CN111459937B - Data table association method, device, server and storage medium - Google Patents

Data table association method, device, server and storage medium Download PDF

Info

Publication number
CN111459937B
CN111459937B CN202010227679.3A CN202010227679A CN111459937B CN 111459937 B CN111459937 B CN 111459937B CN 202010227679 A CN202010227679 A CN 202010227679A CN 111459937 B CN111459937 B CN 111459937B
Authority
CN
China
Prior art keywords
data table
association
data
stage
amount
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010227679.3A
Other languages
Chinese (zh)
Other versions
CN111459937A (en
Inventor
徐翔宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Life Insurance Company of China Ltd
Original Assignee
Ping An Life Insurance Company of China Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Life Insurance Company of China Ltd filed Critical Ping An Life Insurance Company of China Ltd
Priority to CN202010227679.3A priority Critical patent/CN111459937B/en
Publication of CN111459937A publication Critical patent/CN111459937A/en
Application granted granted Critical
Publication of CN111459937B publication Critical patent/CN111459937B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24553Query execution of query operations
    • G06F16/24558Binary matching operations
    • G06F16/2456Join operations

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application is suitable for the technical field of data processing, and provides a data table association method, a device, a server and a storage medium, wherein the method comprises the following steps: determining a plurality of association phases in the instruction set corresponding to the calculation request when the calculation request is received; identifying a first data table and a second data table in a first association phase; if the data quantity of at least one data table in the first data table and the second data table is smaller than the preset data threshold, carrying out data table association in a broadcasting mode; when the association task of the non-first association stage is executed, counting the data volume of the association data table obtained in the first association stage; if the data volume of the associated data table obtained in the first association stage is smaller than the data threshold value, the associated data table obtained in the first association stage is broadcasted to the target data table for association, so that the data volume of network transmission is reduced, the program execution efficiency is improved, and the problem of overflow of a program memory due to data inclination is solved to a certain extent.

Description

Data table association method, device, server and storage medium
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a data table association method, a data table association device, a server, and a storage medium.
Background
APACHE SPARK, also known as Spark computing, is a fast and versatile computing engine designed specifically for large-scale data processing. With the increasing amount of data, spark computing is becoming a widely used computing framework in large data fields. As part of the Spark big data framework SparkSQL supports the use of standard SQL queries and HiveQL to read and write data, can be used for distributed processing of structured data, and can perform SQL-like Spark data queries, helping developers to create and run Spark programs faster.
The main performance bottleneck of big data processing using SparkSQL is the Shuffle problem, and the performance of the Shuffle directly affects the performance and throughput of the whole program. Shuffle describes the segment of the process where data is output from one node to another. The Shuffle includes data preparation at the output stage and data copying at the input stage, i.e., both the Shuffle Write at the output and the Shuffle Read at the input. In general, when data processing is performed in a Shuffle manner, the input end needs to pull data output on other nodes across nodes, which will generate consumption of network resources, memory and disk IO.
For example, when two data tables are associated, if the data of a large table is pulled to a small table, the data amount transmitted by the network is increased due to the fact that more data needs to be pulled, the efficiency of program execution is affected, and even the problem of program memory overflow caused by data inclination occurs.
Disclosure of Invention
In view of the above, the embodiments of the present application provide a method, an apparatus, a server, and a storage medium for associating a data table, so as to solve the problem in the prior art that when data of a large table is pulled to a small table, the data volume is large, resulting in lower program execution efficiency, and even the problem of program memory overflow caused by data tilting may occur.
A first aspect of an embodiment of the present application provides a data table association method, including:
When a calculation request aiming at a data table is received, acquiring an instruction set corresponding to the calculation request;
determining a plurality of association phases in the instruction set, the association phases including a first association phase and non-first association phases adjacent to the first association phase;
When the association task corresponding to the first association stage is executed, a first data table and a second data table to be associated in the first association stage are identified;
if the data quantity of at least one data table in the first data table and the second data table is smaller than a preset data threshold, the first data table and the second data table are associated in a broadcasting mode, and an associated data table in the first association stage is obtained;
When the association task corresponding to the non-first association stage is executed, counting the data volume of the association data table obtained in the first association stage;
And if the data quantity of the associated data table obtained in the first association stage is smaller than the data threshold value, broadcasting the associated data table obtained in the first association stage to a target data table, and associating the associated data table with the target data table through the same key value.
A second aspect of an embodiment of the present application provides a data table associating apparatus, including:
the instruction set acquisition module is used for acquiring an instruction set corresponding to a calculation request when the calculation request aiming at a data table is received;
An association phase determination module configured to determine a plurality of association phases in the instruction set, the association phases including a first association phase and non-first association phases adjacent to the first association phase;
The data table identification module is used for identifying a first data table and a second data table to be associated in the first association stage when the association task corresponding to the first association stage is executed;
The initial association module is used for associating the first data table with the second data table in a broadcast mode to obtain an association data table in the first association stage if the data amount of at least one data table in the first data table and the second data table is smaller than a preset data threshold value;
The data quantity statistics module is used for counting the data quantity of the association data table obtained in the first association stage when the association task corresponding to the non-first association stage is executed;
and the target association module is used for broadcasting the association data table obtained in the first association stage to a target data table if the data quantity of the association data table obtained in the first association stage is smaller than the data threshold value, and associating the association data table with the target data table through the same key value.
A third aspect of an embodiment of the present application provides a server, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the data table association method described in the first aspect when executing the computer program.
A fourth aspect of an embodiment of the present application provides a computer-readable storage medium storing a computer program, which when executed by a processor implements the data table association method described in the first aspect.
A fifth aspect of an embodiment of the present application provides a computer program product, which when run on a server, causes the server to perform the data table association method of the first aspect.
Compared with the prior art, the embodiment of the application has the following advantages:
According to the embodiment of the application, when a calculation request for a data table is received, the instruction set corresponding to the calculation request is obtained, and then each association operation is divided into different association stages according to the instruction set, so that the data quantity in two data tables is firstly counted and judged before each association processing, if the associated data table smaller than the preset data threshold exists, the associated data table is sent to a data node where a target data table is located in a broadcast mode, the association processing is carried out on the node where the target data table with larger data quantity is located, the data quantity of network transmission is reduced, the efficiency of program execution is improved, and the problem of program memory overflow caused by data inclination is solved to a certain extent.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the following will briefly introduce the drawings that are required to be used in the embodiments or the description of the prior art. It is evident that the drawings in the following description are only some embodiments of the present application and that other drawings may be obtained from these drawings without inventive effort for a person of ordinary skill in the art.
FIG. 1 is a flow chart illustrating steps of a method for associating a data table according to an embodiment of the present application;
FIG. 2 is a flow chart illustrating steps of another method for table association according to one embodiment of the present application;
FIG. 3 is a schematic diagram of a data table associating device according to one embodiment of the application;
Fig. 4 is a schematic diagram of a server according to an embodiment of the present application.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth such as the particular system architecture, techniques, etc., in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.
In the prior art, a Spark computing framework provides BroadCast a way of broadcasting the small table to avoid the Shuffle, the way can identify the small table with the data quantity smaller than the specified threshold value by a way of specifying the threshold value by a user, and the small table is broadcasted to a node where the large table is located to perform local association processing of the data table. However, for the data table formed after association, the original Spark execution plan will not broadcast the data table again. For example, after two small tables are associated by broadcasting, a temporary small table is registered, and when the temporary small table is associated with a large table, even if the data size of the temporary small table is smaller than the broadcasting threshold set by the user, the Spark program does not perform broadcasting processing on the temporary small table, but performs association by a Shuffle method. The execution mode can increase the data quantity transmitted by the network, affect the program execution efficiency, and even cause the problem of program memory overflow caused by data inclination.
Therefore, in order to solve the above-mentioned problems, a core idea of the present embodiment is to determine the data amount of the data table obtained by the previous association each time the association processing of the data table is performed, and if the data amount is smaller than the broadcast threshold, continue to perform the association of the data table by adopting the manner of BroadCast broadcast small tables. According to the embodiment, the dynamic judgment on the execution process is increased, so that the data volume transmitted by a network can be reduced, the execution efficiency of a program is improved, and the problem of overflow of a program memory due to data inclination is avoided to a certain extent.
The technical scheme of the application is described below through specific examples.
Referring to fig. 1, a flowchart illustrating steps of a data table association method according to an embodiment of the present application may specifically include the following steps:
s101, when a calculation request aiming at a data table is received, acquiring an instruction set corresponding to the calculation request;
it should be noted that the method can be applied to a server. That is, the execution subject of the present embodiment is a server, and the association of the data table is realized by the processing of the server. The server in this embodiment may be a cloud server or a server cluster composed of a plurality of computing devices, and the specific type of the server is not limited in this embodiment.
In this embodiment, the calculation request may refer to SparkSQL calculation requests, which may be generated by performing calculation on data in the distributed cluster. Specifically, the request may be sent by the staff to the server where the distributed cluster is located by the terminal device according to a certain calculation rule, or may be automatically generated according to a set trigger condition.
The data in the distributed clusters are generally organized in the form of tables, and in the process of realizing the service requirements, the data in different data tables are generally subjected to association calculation so as to meet the corresponding service requirements.
In this embodiment, when a calculation request for a data table is received, an instruction set corresponding to the calculation request, that is, a set of a plurality of instructions to be executed in a subsequent calculation process, may be acquired first. The calculation process required by the SparkSQL calculation request described above may be accomplished by executing each instruction in the instruction set.
S102, determining a plurality of association phases in the instruction set, wherein the association phases comprise a first association phase and non-first association phases adjacent to the first association phase;
typically, the result data of the business requirement is not stored in one data table, but is obtained by performing association calculation on a plurality of data tables. The process of association computation may be logically divided into a plurality of phases, such as a first association phase, a second association phase, a third association phase, and so on.
In this embodiment, all two adjacent association phases may be divided into a first association phase and a non-first association phase. That is, in the present embodiment, after a certain association process is referred to as a first association phase, a subsequent association process adjacent thereto is referred to as a non-first association phase.
For example, if the request is calculated according to SparkSQL, it is necessary to first associate the data table a with the data table B, and then associate the data table C obtained after the association with another data table D, so as to obtain the data table E finally required by the service. In this process, the process of associating the data table a and the data table B may be referred to as a first association stage, and the process of associating the data table C obtained after association with another data table D may be referred to as a second association stage, i.e., a non-first association stage.
S103, identifying a first data table and a second data table to be associated in the first association stage when the association task corresponding to the first association stage is executed;
In the instruction set corresponding to the SparkSQL calculation requests, the object to be calculated in each time is written in the form of an instruction, so as to indicate which data or objects need to be processed in the calculation process.
For the association of data tables, two data tables to be associated in each association operation can also be directly obtained from the instruction. That is, when executing a certain associated operation task in the computing instruction, it may be determined directly from the instruction which two data tables need to be associated.
For example, if policy information is stored in a certain data table a, personal information of the applicant is stored in a data table B, and the policy information and the personal information of the applicant need to be collected in the same data table according to service requirements, then it is necessary to perform association processing on the data table a and the data table B by using SparkSQL. At this time, the data table a and the data table B are two initial data tables to be associated in the first association stage, and the data table a may be regarded as a first data table to be associated, and the data table B may be regarded as a second data table to be associated.
S104, if the data amount of at least one data table in the first data table and the second data table is smaller than a preset data threshold, associating the first data table and the second data table in a broadcasting mode to obtain an associated data table in the first association stage;
In this embodiment, the association processing may be performed on two initial data tables by executing a preset SparkSQL instruction.
In a specific implementation, the SparkSQL instruction may first determine whether the two initial data tables meet a set condition of a data threshold when performing association processing on the two different initial data tables. The data threshold may be a size of a data amount of metadata in the data table, and a specific size thereof may be set according to a practical situation of the system. For example, 200M may be set. That is, when the data amount of metadata in the data table does not exceed 200M, the data table may be subjected to association processing by broadcasting.
Therefore, it may be first determined whether one of the two initial data tables to be associated has a data table whose data amount does not exceed the data threshold, and if so, the data table may be broadcast to the data node where the other data table is located.
For example, if the data amount of the data table a is smaller than the preset data threshold and the data amount of the data table B is larger than the preset data threshold, the data table a may be broadcasted to the data node where the data table B is located; if the data amount of the data table a is greater than the preset data threshold and the data amount of the data table B is less than the preset data threshold, broadcasting the data table B to the data node where the data table a is located; if the data amount of the data table A and the data table B is smaller than the preset data threshold, broadcasting the data table with smaller data amount to the data node where the data table with larger data amount is located.
Of course, if the data amounts of the data table a and the data table B are both greater than the preset data threshold, the data table association cannot be performed in a broadcast manner, and association between the data table a and the data table B needs to be achieved through a Shuffle operation.
In this embodiment, the association processing of the two initial data tables may be implemented by the same key (key) value.
For example, for the data table a storing the policy information and the data table B storing the personal information of the applicant, the key values of the two data tables may be the client numbers, and when the two data tables are associated, the data having the same client number may be associated together to obtain one associated data table.
S105, counting the data volume of the association data table obtained in the first association stage when the association task corresponding to the non-first association stage is executed;
in this embodiment, the instruction of the non-first association stage may be set in the SparkSQL program. That is, when it is necessary to perform association processing on a plurality of data tables, it is possible to program which two data tables are associated first, and then associate the data table obtained after association with the other data table.
Taking the second association phase as an example, the non-first association phase is the first association phase. In the prior art, when the second association is performed, sparkSQL does not perform association by broadcasting, but needs to perform association by a Shuffle manner. In order to reduce the data size transmitted by the network and improve the program execution efficiency, the execution mode may be preset in SparkSQL instructions, so that before the association processing is performed on two data tables in the non-first association stage, the data size of the associated data table obtained in the previous stage is compared with a preset data threshold.
In a specific implementation, a detection condition may be set in the SparkSQL instruction, and when a task in a non-first association stage is detected, it is determined whether there is an instruction trigger that needs to perform association processing on a data table obtained in a previous association stage and another data table. If so, an operation instruction for comparing the data size of the data table with a preset data threshold value can be called, and the data size of the associated data table in the previous stage is counted by executing the instruction, and whether the data size is smaller than the data threshold value is judged.
And S106, if the data volume of the associated data table obtained in the first association stage is smaller than the data threshold, broadcasting the associated data table obtained in the first association stage to a target data table, and associating the associated data table with the target data table through the same key value.
If the data volume of the associated data table obtained in the previous association stage, i.e. the first association stage, is smaller than the data threshold, the association task of the stage can be continuously executed in a BroadCast broadcast small table mode, and the associated data table is broadcast to a target data table for association, wherein the target data table is the data table for executing the association task of the stage.
Similarly, the association processing of the association data table and the target data table can be performed by the same key value.
In the embodiment of the application, when a calculation request for a data table is received, an instruction set corresponding to the calculation request is acquired, and then each association operation is divided into different association stages according to the instruction set, so that the data quantity in two data tables is firstly counted and judged before each association processing, if the associated data table smaller than a preset data threshold exists, the associated data table is sent to a data node where a target data table is located in a broadcast mode, the association processing is carried out on the node where the target data table with larger data quantity is located, the data quantity of network transmission is reduced, the efficiency of program execution is improved, and the problem of program memory overflow caused by data inclination is solved to a certain extent.
Referring to fig. 2, a flowchart illustrating steps of another data table association method according to an embodiment of the present application may specifically include the following steps:
s201, when a calculation request aiming at a data table is received, acquiring an instruction set corresponding to the calculation request;
the execution body of the embodiment is a server, and the association of the data table is realized through the processing of the server.
The instruction set in this embodiment is a set of multiple instructions that need to be executed in the subsequent computing process. The computation required by the computation request described above may be accomplished by executing each instruction in the instruction set.
S202, identifying a plurality of association instructions contained in the instruction set, wherein each association instruction corresponds to an association stage;
In general, one association operation for any two data tables belongs to one association phase. The associated instruction may be referred to by identifying the instruction set after the instruction set corresponding to the calculation request is obtained. Generally, each association instruction corresponds to a respective association phase. The instruction set contains a plurality of association instructions, and basically, the calculation request needs to complete the association operation for a plurality of times.
S203, determining the execution sequence of each associated instruction, identifying a first associated instruction positioned at the first bit of the execution sequence, and determining an associated stage corresponding to the first associated instruction as a first associated stage;
In this embodiment, when determining how many associated stages are included in the whole process, the execution sequence of each associated instruction may be determined according to the execution sequence of each instruction.
Generally, the associated instruction in the first bit in the execution order may be identified as a first associated instruction, and the associated phase corresponding to the first associated instruction is referred to as a first associated phase; and the next association instruction executed after the first association stage is used as a second association instruction, and the association stage corresponding to the second association instruction is the second association stage. In this way the execution order of the respective associated phases is determined one by one.
S204, identifying a first data table and a second data table to be associated in the first association stage when the association task corresponding to the first association stage is executed;
For association of data tables, two data tables to be associated in each association operation can be directly obtained from the instruction. That is, when executing a certain association instruction, it may be determined directly from the instruction which two data tables need to be associated.
S205, if the data amount of at least one data table in the first data table and the second data table is smaller than a preset data threshold, associating the first data table and the second data table in a broadcasting mode to obtain an associated data table in the first association stage;
in this embodiment, when the first association instruction is executed, it may be first determined whether the first data table and the second data table to be associated meet the condition of the set data threshold. Thus, the first data amount of the first data table and the second data amount of the second data table may be counted, respectively.
In general, the size of the data volume of a data table can be determined directly by looking up metadata information recorded in the data table.
Therefore, in this embodiment, the data table information of the first data table and the second data table may be acquired respectively, and then by reading the first data amount of the first data table and the second data amount of the second data table recorded in the data table information, it is determined whether or not there is a case where the data amount of one data table is smaller than the preset data threshold value in the first data table and the second data table.
If at least one data table of the first data table and the second data table is smaller than the preset data threshold and the first data amount is larger than the second data amount, the second data amount of the at least second data table is smaller than the data threshold, at this time, the second data table can be broadcasted to a first data node where the first data table is located, and the first data table and the second data table are associated at the first data node to obtain an associated data table in a first association stage.
If at least one data table of the first data table and the second data table is smaller than the preset data threshold and the first data amount is smaller than or equal to the second data amount, the first data amount of the at least first data table is smaller than the data threshold, at this time, the first data table can be broadcasted to a second data node where the second data table is located, and the first data table and the second data table are associated at the second data node, so that an associated data table of the first association stage is obtained.
Of course, if the data amounts of the first data table and the second data table are both greater than the preset data threshold, the data table association cannot be performed in a broadcast manner. At this time, association of the two needs to be achieved by a Shuffle operation.
S206, when the association task corresponding to the non-first association stage is executed, determining the memory size occupied by the association data table obtained in the first association stage, and taking the memory size as the data size of the association data table obtained in the first association stage;
Because the associated data table obtained in each stage is obtained by carrying out association processing on two different data tables, the associated data table does not fall into an entity table, but is only a temporary table in a program. The data size of the associated data table cannot be determined directly by looking up the metadata information recorded in the data table. At this time, the size of the data amount in the association data table may be determined by determining the memory size occupied by the association data table.
The previous association phase adjacent to the non-first association phase may refer to the first association phase that is the last association phase in the current association phase. Taking the current association phase as the second association phase as an example, the adjacent previous association phase is the first association phase.
S207, if the data volume of the associated data table obtained in the first association stage is smaller than the data threshold, broadcasting the associated data table obtained in the first association stage to a target data node where a target data table is located;
After determining the data size of the associated data table, it may be continued to determine whether the data size of the associated data table is less than a preset data threshold. If the data amount of the associated data table is smaller than the data threshold, the associated operation can be performed in a broadcasting small table mode.
In a specific implementation, for an associated data table with a data size smaller than a preset data threshold, the associated data table may be broadcasted to a target data node where the target data table is located. The target data table is the data table which needs to be associated with the associated data table of the previous stage in the present stage.
S208, the target data node correlates the correlated data table obtained in the first correlation stage with the data corresponding to the same key value in the target data table.
When the target data node performs association processing on the association data table and the target data table, the association processing can be realized through the same key value. The same key value may refer to a field that exists in both the associated data table and the target data table and can uniquely identify a piece of data.
It should be noted that, in the present embodiment, when describing the association task corresponding to the non-first association stage, only one association operation is taken as an example, and it should be understood by those skilled in the art that, when executing each association task, the present embodiment may execute a corresponding data detection process through an instruction, count the data size of an association data table that needs to be associated in the current stage, and if the data size is smaller than a preset data threshold, broadcast the association data table to the data stage where the target data table in the current stage is located, thereby reducing the data size transmitted by the network and improving the efficiency of program execution.
In the embodiment of the application, when each data table association task is executed, whether the data volume of the associated data table obtained in the previous stage is smaller than the preset data threshold is judged, if so, the associated data table can be sent to the data node where the target data table is located, and association processing is carried out on the node where the target data table is located. According to the embodiment, the dynamic judgment of the executing process is increased, and the data amount of the associated data table is judged once when each associated instruction is executed, so that the data table association can be carried out in a broadcasting mode under the condition that the data threshold requirement is met, the number of times of the Shuffle operation is reduced, the data amount of network transmission is reduced, the program executing efficiency is improved, and the occurrence risk of data inclination in the data transmission process is avoided to the greatest extent.
It should be noted that, the sequence number of each step in the above embodiment does not mean the sequence of execution sequence, and the execution sequence of each process should be determined by its function and internal logic, and should not limit the implementation process of the embodiment of the present application in any way.
Referring to fig. 3, a schematic diagram of a data table associating device according to an embodiment of the present application may specifically include the following modules:
An instruction set obtaining module 301, configured to obtain, when a calculation request for a data table is received, an instruction set corresponding to the calculation request;
An association stage determination module 302 configured to determine a plurality of association stages in the instruction set, the association stages including a first association stage and non-first association stages adjacent to the first association stage;
The data table identifying module 303 is configured to identify a first data table and a second data table to be associated in the first association stage when the association task corresponding to the first association stage is executed;
An initial association module 304, configured to, if at least one data amount of the first data table and the second data table is smaller than a preset data threshold, associate the first data table and the second data table in a broadcast manner, and obtain an associated data table in the first association stage;
the data amount statistics module 305 is configured to, when executing the association task corresponding to the non-first association stage, count the data amount of the association data table obtained in the first association stage;
And the target association module 306 is configured to, if the data size of the association data table obtained in the first association stage is smaller than the data threshold, broadcast the association data table obtained in the first association stage to a target data table, and associate the association data table with the target data table through the same key value.
In the embodiment of the present application, the association phase determining module 302 may specifically include the following sub-modules:
And the associated instruction identification sub-module is used for identifying a plurality of associated instructions contained in the instruction set, and each associated instruction corresponds to one associated stage.
In an embodiment of the present application, the association phase determining module 302 may further include the following sub-modules:
an execution sequence determining sub-module, configured to determine an execution sequence of each associated instruction;
And the first association stage identification sub-module is used for identifying a first association instruction in the first bit of the execution sequence and determining an association stage corresponding to the first association instruction as a first association stage.
In the embodiment of the present application, the initial association module 304 may specifically include the following sub-modules:
The data volume statistics sub-module is used for respectively counting the first data volume of the first data table and the second data volume of the second data table;
a first association submodule, configured to, if at least one of the first data table and the second data table has a data amount smaller than a preset data threshold and the first data amount is larger than the second data amount, broadcast the second data table to a first data node where the first data table is located, and associate the first data table with the second data table at the first data node, so as to obtain an associated data table in the first association stage;
And the second association submodule is used for broadcasting the first data table to a second data node where the second data table is located if the data amount of at least one data table in the first data table and the second data table is smaller than a preset data threshold value and the first data amount is smaller than or equal to the second data amount, and associating the first data table and the second data table at the second data node to obtain an association data table in the first association stage.
In the embodiment of the present application, the data volume statistics submodule may specifically include the following units:
a data table information acquisition unit configured to acquire data table information of the first data table and the second data table, respectively;
and a data amount reading unit configured to read a first data amount of the first data table and a second data amount of the second data table described in the data table information.
In the embodiment of the present application, the data amount statistics module 305 may specifically include the following sub-modules:
and the memory statistics sub-module is used for determining the memory size occupied by the association data table obtained in the first association stage when the association task corresponding to the non-first association stage is executed, and taking the memory size as the data volume of the association data table obtained in the first association stage.
In the embodiment of the present application, the target association module 306 may specifically include the following sub-modules:
the associated data table broadcasting sub-module is used for broadcasting the associated data table obtained in the first associated stage to a target data node where the target data table is located;
And the target data table association sub-module is used for associating the association data table obtained in the first association stage with the data corresponding to the same key value in the target data table at the target data node.
When a calculation request aiming at a data table is received, the embodiment of the application acquires the instruction set corresponding to the calculation request, and divides each association operation into different association stages according to the instruction set, so that the data quantity in two data tables is firstly counted and judged before each association processing, if the associated data table smaller than the preset data threshold exists, the associated data table is sent to the data node where the target data table is located in a broadcast mode, the association processing is carried out on the node where the target data table with larger data quantity is located, the data quantity of network transmission is reduced, the efficiency of program execution is improved, and the problem of program memory overflow caused by data inclination is solved to a certain extent.
For the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference should be made to the description of the method embodiments.
Referring to FIG. 4, a schematic diagram of a server of one embodiment of the application is shown. As shown in fig. 4, the server 400 of the present embodiment includes: a processor 410, a memory 420, and a computer program 421 stored in the memory 420 and executable on the processor 410. The processor 410, when executing the computer program 421, implements the steps of the various embodiments of the data table association method described above, such as steps S101 to S106 shown in fig. 1. Or the processor 410, when executing the computer program 421, performs the functions of the modules/units in the above-described device embodiments, for example, the functions of the modules 301 to 306 shown in fig. 3.
Illustratively, the computer program 421 may be partitioned into one or more modules/units that are stored in the memory 420 and executed by the processor 410 to accomplish the present application. The one or more modules/units may be a series of computer program instruction segments capable of performing particular functions, which may be used to describe the execution of the computer program 421 in the server 400. For example, the computer program 421 may be divided into an instruction set acquisition module, an association phase determination module, a data table identification module, an initial association module, a data amount statistics module, and a target association module, each of which specifically functions as follows:
the instruction set acquisition module is used for acquiring an instruction set corresponding to a calculation request when the calculation request aiming at a data table is received;
An association phase determination module configured to determine a plurality of association phases in the instruction set, the association phases including a first association phase and non-first association phases adjacent to the first association phase;
The data table identification module is used for identifying a first data table and a second data table to be associated in the first association stage when the association task corresponding to the first association stage is executed;
The initial association module is used for associating the first data table with the second data table in a broadcast mode to obtain an association data table in the first association stage if the data amount of at least one data table in the first data table and the second data table is smaller than a preset data threshold value;
The data quantity statistics module is used for counting the data quantity of the association data table obtained in the first association stage when the association task corresponding to the non-first association stage is executed;
and the target association module is used for broadcasting the association data table obtained in the first association stage to a target data table if the data quantity of the association data table obtained in the first association stage is smaller than the data threshold value, and associating the association data table with the target data table through the same key value.
The server 400 may be a computing device such as a desktop computer, a notebook computer, a palm computer, and a cloud server. The server 400 may include, but is not limited to, a processor 410, a memory 420. It will be appreciated by those skilled in the art that fig. 4 is merely an example of a server 400 and is not meant to be limiting of the server 400, and may include more or fewer components than shown, or may combine certain components, or different components, e.g., the server 400 may further include input and output devices, network access devices, buses, etc.
The Processor 410 may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (DIGITAL SIGNAL Processor, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), off-the-shelf Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The storage 420 may be an internal storage unit of the server 400, such as a hard disk or a memory of the server 400. The memory 420 may also be an external storage device of the server 400, such as a plug-in hard disk, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD) or the like, which are provided on the server 400. Further, the memory 420 may also include both internal storage units and external storage devices of the server 400. The memory 420 is used to store the computer program 421 as well as other programs and data required by the server 400. The memory 420 may also be used to temporarily store data that has been output or is to be output.
The above embodiments are only for illustrating the technical solution of the present application, and are not limited thereto. Although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application.

Claims (7)

1. A method of table association, comprising:
When a calculation request aiming at a data table is received, acquiring an instruction set corresponding to the calculation request;
determining a plurality of association phases in the instruction set, the association phases including a first association phase and non-first association phases adjacent to the first association phase;
When the association task corresponding to the first association stage is executed, a first data table and a second data table to be associated in the first association stage are identified;
Respectively counting a first data amount of the first data table and a second data amount of the second data table; if at least one data table of the first data table and the second data table has a data amount smaller than a preset data threshold value and the first data amount is larger than the second data amount, broadcasting the second data table to a first data node where the first data table is located, and associating the first data table with the second data table at the first data node to obtain an associated data table in the first association stage; if at least one data table in the first data table and the second data table has a data amount smaller than a preset data threshold and the first data amount is smaller than or equal to the second data amount, broadcasting the first data table to a second data node where the second data table is located, and associating the first data table with the second data table at the second data node to obtain an associated data table in the first association stage;
When the association task corresponding to the non-first association stage is executed, determining the memory size occupied by the association data table obtained in the first association stage, and taking the memory size as the data size of the association data table obtained in the first association stage;
And if the data quantity of the association data table obtained in the first association stage is smaller than the data threshold value, broadcasting the association data table obtained in the first association stage to a target data node where a target data table is located, and associating the association data table obtained in the first association stage with data corresponding to the same key value in the target data table at the target data node.
2. The method of claim 1, wherein the determining a plurality of association phases in the instruction set comprises:
a plurality of associated instructions contained in the instruction set are identified, each associated instruction corresponding to a respective associated stage.
3. The method of claim 2, further comprising, after said identifying a plurality of associated instructions contained in said instruction set:
determining the execution sequence of each associated instruction;
And identifying a first association instruction in the first bit of the execution sequence, and determining an association stage corresponding to the first association instruction as a first association stage.
4. The method of claim 1, wherein the separately counting the first data amount of the first data table and the second data amount of the second data table comprises:
respectively acquiring data table information of the first data table and the second data table;
And reading the first data amount of the first data table and the second data amount of the second data table recorded in the data table information.
5. A data table associating apparatus, comprising:
the instruction set acquisition module is used for acquiring an instruction set corresponding to a calculation request when the calculation request aiming at a data table is received;
An association phase determination module configured to determine a plurality of association phases in the instruction set, the association phases including a first association phase and non-first association phases adjacent to the first association phase;
The data table identification module is used for identifying a first data table and a second data table to be associated in the first association stage when the association task corresponding to the first association stage is executed;
The initial association module is used for respectively counting the first data volume of the first data table and the second data volume of the second data table; if at least one data table of the first data table and the second data table has a data amount smaller than a preset data threshold value and the first data amount is larger than the second data amount, broadcasting the second data table to a first data node where the first data table is located, and associating the first data table with the second data table at the first data node to obtain an associated data table in the first association stage; if at least one data table in the first data table and the second data table has a data amount smaller than a preset data threshold and the first data amount is smaller than or equal to the second data amount, broadcasting the first data table to a second data node where the second data table is located, and associating the first data table with the second data table at the second data node to obtain an associated data table in the first association stage;
the data volume statistics module is used for determining the memory size occupied by the association data table obtained in the first association stage when the association task corresponding to the non-first association stage is executed, and taking the memory size as the data volume of the association data table obtained in the first association stage;
And the target association module is used for broadcasting the association data table obtained in the first association stage to a target data node where the target data table is located if the data amount of the association data table obtained in the first association stage is smaller than the data threshold value, and associating the association data table obtained in the first association stage with data corresponding to the same key value in the target data table at the target data node.
6. A server comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the data table association method of any of claims 1 to 4 when the computer program is executed.
7. A computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the data table association method according to any one of claims 1 to 4.
CN202010227679.3A 2020-03-27 2020-03-27 Data table association method, device, server and storage medium Active CN111459937B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010227679.3A CN111459937B (en) 2020-03-27 2020-03-27 Data table association method, device, server and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010227679.3A CN111459937B (en) 2020-03-27 2020-03-27 Data table association method, device, server and storage medium

Publications (2)

Publication Number Publication Date
CN111459937A CN111459937A (en) 2020-07-28
CN111459937B true CN111459937B (en) 2024-06-07

Family

ID=71682481

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010227679.3A Active CN111459937B (en) 2020-03-27 2020-03-27 Data table association method, device, server and storage medium

Country Status (1)

Country Link
CN (1) CN111459937B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112181831A (en) * 2020-09-28 2021-01-05 中国平安财产保险股份有限公司 Script performance verification method, device and equipment based on keywords and storage medium
CN112732715B (en) * 2020-12-31 2023-08-25 星环信息科技(上海)股份有限公司 Data table association method, device and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108268586A (en) * 2017-09-22 2018-07-10 广东神马搜索科技有限公司 Across the data processing method of more tables of data, device, medium and computing device
CN108897796A (en) * 2018-06-12 2018-11-27 平安科技(深圳)有限公司 A kind of operation system calls method, storage medium and the server of influxdb database
WO2019033519A1 (en) * 2017-08-17 2019-02-21 平安科技(深圳)有限公司 User permission data query method and apparatus, electronic device, and medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019033519A1 (en) * 2017-08-17 2019-02-21 平安科技(深圳)有限公司 User permission data query method and apparatus, electronic device, and medium
CN108268586A (en) * 2017-09-22 2018-07-10 广东神马搜索科技有限公司 Across the data processing method of more tables of data, device, medium and computing device
CN108897796A (en) * 2018-06-12 2018-11-27 平安科技(深圳)有限公司 A kind of operation system calls method, storage medium and the server of influxdb database

Also Published As

Publication number Publication date
CN111459937A (en) 2020-07-28

Similar Documents

Publication Publication Date Title
US11605087B2 (en) Method and apparatus for identifying identity information
CN108846749B (en) Partitioned transaction execution system and method based on block chain technology
CN108881120B (en) Data processing method and device based on block chain
US8805850B2 (en) Hardware-accelerated relational joins
WO2015184782A1 (en) Data query method and device
CN108268586B (en) Data processing method, device, medium and computing equipment across multiple data tables
CN110162270B (en) Data storage method, storage node and medium based on distributed storage system
CN110489405B (en) Data processing method, device and server
US20190171639A1 (en) Method, medium, and system for joining data tables
CN111459937B (en) Data table association method, device, server and storage medium
CN109947804B (en) Data set query optimization method and device, server and storage medium
US10496659B2 (en) Database grouping set query
CN110888981A (en) Title-based document clustering method and device, terminal equipment and medium
WO2017020735A1 (en) Data processing method, backup server and storage system
WO2018205689A1 (en) File merging method, storage device, storage apparatus, and storage medium
US20100023477A1 (en) Optimized bulk computations in data warehouse environments
CN113486109A (en) Data synchronization method and device of heterogeneous database and electronic equipment
US9805091B2 (en) Processing a database table
CN113283351A (en) Video plagiarism detection method using CNN to optimize similarity matrix
CN107368281B (en) Data processing method and device
WO2015124086A1 (en) Virus signature matching method and apparatus
CN117631955A (en) Data reduction method, device and system
CN113419792A (en) Event processing method and device, terminal equipment and storage medium
CN111367992B (en) Data processing method and device, computer storage medium and electronic equipment
CN110046180B (en) Method and device for locating similar examples and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant