CN110851452B

CN110851452B - Data table connection processing method and device, electronic equipment and storage medium

Info

Publication number: CN110851452B
Application number: CN202010044898.8A
Authority: CN
Inventors: 费伟
Original assignee: Yidu Cloud Beijing Technology Co Ltd
Current assignee: Yidu Cloud Beijing Technology Co Ltd
Priority date: 2020-01-16
Filing date: 2020-01-16
Publication date: 2020-09-04
Anticipated expiration: 2040-01-16
Also published as: CN110851452A

Abstract

The disclosure relates to a data table connection processing method and device, electronic equipment and a computer readable storage medium, relates to the technical field of big data processing, and can be applied to a scene of judging whether to perform hash connection processing on two data tables. The data table connection processing method comprises the following steps: acquiring a data table to be processed; the data table to be processed comprises a reference data table and a data table to be connected; determining data table information of a data table to be processed; the data table information comprises the data quantity of a reference data table and the data line number of the reference data table; judging whether the data volume is smaller than a data volume threshold value or not, and judging whether the data line number is smaller than a line number threshold value or not; and if the data volume is less than the data volume threshold and the data line number is less than the line number threshold, performing data table connection processing on the reference data table and the data table to be connected in a preset connection mode. The method and the device can provide a judgment strategy to determine whether to carry out Hash connection on the data table to be processed so as to reduce the problems of memory leakage and the like.

Description

Data table connection processing method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of big data processing technologies, and in particular, to a data table connection processing method, a data table connection processing apparatus, an electronic device, and a computer-readable storage medium.

Background

In the big data era, big data analysis and mining are increasingly applied to daily life. For example, information of one person is generally distributed in different fields, and even if the information is distributed in the same field, there may be information corresponding to different scenes. Therefore, aggregating information of different domains and scenes is a frequent operation that may be performed in the big data calculation process.

Apache Spark is a fast, general-purpose computing engine designed specifically for large-scale data processing, supporting Structured Query Language (SQL) queries for table data in a data warehouse. When two tables are connected, whether hash connection (namely hash join) is started or not can be decided according to the size of data.

In the process of connecting the data tables by adopting the prior art, the factor of data density is not considered, and for data with the same size, due to the difference of compression degree or data type, when the corresponding Spark table connection task is executed, the Spark task may be executed unstably, and the problem of memory leakage is easily caused.

It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present disclosure, and thus may include information that does not constitute prior art known to those of ordinary skill in the art.

Disclosure of Invention

The present disclosure is directed to a data table connection processing method, a data table connection processing apparatus, an electronic device, and a computer-readable storage medium, so as to overcome, at least to some extent, a problem that a memory leak may occur due to factors such as data density not being considered in a process of performing hash connection on a data table.

Additional features and advantages of the disclosure will be set forth in the detailed description which follows, or in part will be obvious from the description, or may be learned by practice of the invention.

According to a first aspect of the present disclosure, there is provided a data table connection processing method, including: acquiring a data table to be processed; the data table to be processed comprises a reference data table and a data table to be connected; determining data table information of a data table to be processed; the data table information comprises the data quantity of a reference data table and the data line number of the reference data table; judging whether the data quantity of the reference data table is smaller than a data quantity threshold value or not, and judging whether the data line number of the reference data table is smaller than a line number threshold value or not; and if the data volume of the reference data table is less than the data volume threshold and the data line number of the reference data table is less than the line number threshold, performing data table connection processing on the reference data table and the data table to be connected in a preset connection mode.

Optionally, before determining the data table information of the data table to be processed, the method further includes: judging whether the reference data table is a compressed data table or not; if the reference data table is a compressed data table, determining the file type of the reference data table; and acquiring the file compression ratio, and performing pre-estimation processing on the real data volume of the reference data table according to the file compression ratio and the file type to determine the pre-estimated real data volume of the reference data table.

Optionally, after determining the estimated real data amount of the reference data table, the method further includes: acquiring a configuration file, and determining configuration parameters corresponding to a to-be-processed data table from the configuration file; and backfilling the configuration parameters according to the estimated real data volume to generate a system file corresponding to the data table to be processed.

Optionally, determining the data table information of the data table to be processed includes: acquiring metadata information corresponding to a to-be-processed data table, and judging whether the metadata information comprises a preset data table reference field or not; if the metadata information comprises a data table reference field, acquiring a first data table field based on the metadata information; determining the data table information according to the first data table field.

Optionally, determining the data table information of the data table to be processed further includes: if the metadata information does not comprise the data table reference field, acquiring a system file corresponding to the data table to be processed; determining a second data table field corresponding to the data table to be processed according to the system file; determining the data table information from the second data table field.

Optionally, the preset connection mode includes a hash connection mode, and the data table connection processing is performed on the reference data table and the data table to be connected through the preset connection mode, including: determining a hash object corresponding to the reference data table; broadcasting the hash object to a plurality of computing nodes in a data broadcasting mode; and querying data in the hash object by the data table to be connected according to the target keyword so as to perform data table connection processing in a hash connection mode.

Optionally, determining the hash object corresponding to the reference data table includes: determining a reference data table and a data table to be connected from the data table to be processed; determining a target keyword; the target keywords comprise keywords corresponding to data table connection processing of the reference data table and the data table to be connected; and performing object construction processing on the reference data table according to the target keyword to generate a hash object.

According to a second aspect of the present disclosure, there is provided a data table connection processing apparatus including: the data table acquisition module is used for acquiring a data table to be processed; the data table to be processed comprises a reference data table and a data table to be connected; the information determining module is used for determining data table information of the data table to be processed; the data table information comprises the data quantity of a reference data table and the data line number of the reference data table; the judging module is used for judging whether the data volume of the reference data table is smaller than a data volume threshold value or not and judging whether the data line number of the reference data table is smaller than a line number threshold value or not; and the connection processing module is used for performing data table connection processing on the reference data table and the data table to be connected in a preset connection mode if the data amount of the reference data table is less than the data amount threshold and the data line number of the reference data table is less than the line number threshold.

Optionally, the data table connection processing apparatus further includes a data amount estimation module, configured to determine whether the reference data table is a compressed data table; if the reference data table is a compressed data table, determining the file type of the reference data table; and acquiring the file compression ratio, and performing pre-estimation processing on the real data volume of the reference data table according to the file compression ratio and the file type to determine the pre-estimated real data volume of the reference data table.

Optionally, the data table connection processing apparatus further includes a file backfill module, configured to obtain a configuration file, and determine a configuration parameter corresponding to the data table to be processed from the configuration file; and backfilling the configuration parameters according to the estimated real data volume to generate a system file corresponding to the data table to be processed.

Optionally, the information determining module includes a first information determining unit, configured to obtain metadata information corresponding to the to-be-processed data table, and determine whether the metadata information includes a preset data table reference field; if the metadata information comprises a data table reference field, acquiring a first data table field based on the metadata information; determining the data table information according to the first data table field.

Optionally, the information determining module includes a second information determining unit, configured to obtain a system file corresponding to the to-be-processed data table if the metadata information does not include the data table reference field; determining a second data table field corresponding to the data table to be processed according to the system file; determining the data table information from the second data table field.

Optionally, the connection processing module includes a connection processing unit, configured to determine a hash object corresponding to the reference data table; broadcasting the hash object to a plurality of computing nodes in a data broadcasting mode; and querying data in the hash object by the data table to be connected according to the target keyword so as to perform data table connection processing in a hash connection mode.

Optionally, the connection processing unit includes an object generation subunit, configured to determine a reference data table and a to-be-connected data table from the to-be-processed data table; determining a target keyword; the target keywords comprise keywords corresponding to data table connection processing of the reference data table and the data table to be connected; and performing object construction processing on the reference data table according to the target keyword to generate a hash object.

According to a third aspect of the present disclosure, there is provided an electronic device comprising: a processor; and a memory having computer readable instructions stored thereon, the computer readable instructions when executed by the processor implementing the data table connection processing method according to any one of the above.

According to a fourth aspect of the present disclosure, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a data table connection processing method according to any one of the above.

The technical scheme provided by the disclosure can comprise the following beneficial effects:

the data table connection processing method in the exemplary embodiment of the disclosure acquires a data table to be processed; the data table to be processed comprises a reference data table and a data table to be connected; determining data table information of a data table to be processed; the data table information comprises the data quantity of a reference data table and the data line number of the reference data table; judging whether the data quantity of the reference data table is smaller than a data quantity threshold value or not, and judging whether the data line number of the reference data table is smaller than a line number threshold value or not; and if the data volume of the reference data table is less than the data volume threshold and the data line number of the reference data table is less than the line number threshold, performing data table connection processing on the reference data table and the data table to be connected in a preset connection mode. According to the data table connection processing method, on one hand, whether the data table to be processed is subjected to table connection processing in a preset connection mode can be judged by combining the data quantity and the data line number of the reference data table, so that the judgment strategy for the data table connection processing is more reasonable. On the other hand, the reference data table meeting the preset condition is subjected to table connection processing in a preset connection mode, so that the occurrence probability of memory leakage can be greatly reduced in the table connection task execution process, and the task execution efficiency is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure. It is to be understood that the drawings in the following description are merely exemplary of the disclosure, and that other drawings may be derived from those drawings by one of ordinary skill in the art without the exercise of inventive faculty. In the drawings:

FIG. 1 schematically illustrates a flow chart of a data table join processing method according to an exemplary embodiment of the present disclosure;

FIG. 2 schematically illustrates a process diagram for determining an estimated real data amount of a reference data table according to an exemplary embodiment of the present disclosure;

FIG. 3 schematically illustrates a process diagram of backfilling according to estimated real data volume of a reference data table according to an exemplary embodiment of the present disclosure;

FIG. 4 is a diagram schematically illustrating a process of obtaining spreadsheet information when complete statistics are included in metadata information according to an exemplary embodiment of the present disclosure;

FIG. 5 schematically illustrates a process diagram for obtaining spreadsheet information when complete statistics are not contained in the metadata information according to an exemplary embodiment of the present disclosure;

FIG. 6 schematically illustrates a process diagram for hash join processing of a table of data to be processed according to an exemplary embodiment of the present disclosure;

FIG. 7 schematically illustrates a process diagram for determining a hash object corresponding to a reference data table according to an exemplary embodiment of the present disclosure;

FIG. 8 schematically illustrates a block diagram of a data table connection handling apparatus according to an exemplary embodiment of the present disclosure;

FIG. 9 schematically illustrates a block diagram of an electronic device according to an exemplary embodiment of the present disclosure;

fig. 10 schematically illustrates a schematic diagram of a computer-readable storage medium according to an exemplary embodiment of the present disclosure.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The same reference numerals denote the same or similar parts in the drawings, and thus, a repetitive description thereof will be omitted.

Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the subject matter of the present disclosure can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known structures, methods, devices, implementations, materials, or operations are not shown or described in detail to avoid obscuring aspects of the disclosure.

The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. That is, these functional entities may be implemented in the form of software, or in one or more software-hardened modules, or in different networks and/or processor devices and/or microcontroller devices.

The Spark calculation engine supports SQL language query of table data in the data warehouse tool Hive. When two tables are connected, whether the hashjoin is started or not can be determined according to the size of data, if the hashjoin is started, the data of the tables can be generated into a HashMap structure according to keywords (key) during connection and other used columns, and the data structure is stored in a Driver node (Driver) memory and then broadcasted to all actuator node (executor) memories. However, in the data broadcasting process, the data density is not considered, for example, 10MB of data is equal, different data types may contain different numbers of lines, which may be 10 ten thousand, 100 ten thousand or even 1000 ten thousand, and if 1000 ten thousand, a simple data is amplified by many times in the object programming. Especially if the data itself is compressed, the data density is more unstable, and the number of data lines is naturally different for different data types and the same data size. Therefore, the direct consequence of performing the data table connection processing without considering the data density is that the execution of the Spark task itself is unstable, and the memory leakage problem of the Driver memory and the execution memory is frequent due to the hashjoin problem.

Based on this, in the present exemplary embodiment, first, a data table connection processing method is provided, which may be implemented by using a server, or may also be implemented by using a terminal device, where the terminal described in the present disclosure may include a mobile terminal such as a mobile phone, a tablet computer, a notebook computer, a palm computer, a Personal Digital Assistant (PDA), and a fixed terminal such as a desktop computer. FIG. 1 schematically illustrates a schematic diagram of a data table join processing method flow, according to some embodiments of the present disclosure. Referring to fig. 1, the data table connection processing method may include the steps of:

step S110, acquiring a data table to be processed; the data table to be processed comprises a reference data table and a data table to be connected.

Step S120, determining data table information of a data table to be processed; the data table information comprises the data quantity of the reference data table and the data line number of the reference data table.

Step S130, determining whether the data amount of the reference data table is smaller than the data amount threshold, and determining whether the data line number of the reference data table is smaller than the line number threshold.

Step S140, if the data amount of the reference data table is smaller than the data amount threshold and the data line number of the reference data table is smaller than the line number threshold, performing data table connection processing on the reference data table and the data table to be connected in a preset connection manner.

According to the data table connection processing method in the embodiment of the present invention, on one hand, whether to perform table connection processing on the data table to be processed in the preset connection mode can be determined by combining the data amount and the data line number of the reference data table, so that the determination policy for performing the table connection processing is more reasonable. On the other hand, the reference data table meeting the preset condition is subjected to table connection processing in a preset connection mode, so that the occurrence probability of memory leakage can be greatly reduced in the table connection task execution process, and the task execution efficiency is improved.

Next, a data table join processing method in the present exemplary embodiment will be further described.

In step S110, a to-be-processed data table is acquired; the data table to be processed comprises a reference data table and a data table to be connected.

In some exemplary embodiments of the present disclosure, in a big data scenario, an operation of aggregating information in different fields and scenarios may be generally generated, which is a frequent operation that may occur in a big data calculation process. The data table to be processed may be a data table for performing data table connection processing, and in a general big data processing process, various libraries and tables may be constructed by using a data warehouse tool Hive, and meanwhile, SQL language is also provided for performing query analysis on data. The data table to be processed can be a Hive data table. The present disclosure will explain the procedure of the data table join processing in detail by taking the procedure of the data table join processing performed by two data tables as an example. The data table connection mode may include a hash connection mode, a nested loop join (nested loop join) mode, and a sort merge join (sort-join) mode. For example, when performing data table connection processing on a data table to be processed in a hashjoin manner, the reference data table may be a hash (hash) table in the hashjoin, which is also referred to as a small table; the to-be-connected data table may be another data table.

Before the table connection processing is performed, the reference data table and the data table to be connected may be acquired to perform the data table connection processing on the reference data table and the data table to be connected.

In step S120, determining data table information of the data table to be processed; the data table information comprises the data quantity of the reference data table and the data line number of the reference data table.

In some exemplary embodiments of the present disclosure, the data table information may be information reflecting the content of the data size, the data line number, and the like corresponding to the data table to be processed. The data size of the reference data table may represent the data size of the reference data table. The number of data lines of the reference data amount may represent the number of data lines contained in the reference data.

According to some exemplary embodiments of the present disclosure, it is determined whether the reference data table is a compressed data table; if the reference data table is a compressed data table, determining the file type of the reference data table; and acquiring the file compression ratio, and performing pre-estimation processing on the real data volume of the reference data table according to the file compression ratio and the file type to determine the pre-estimated real data volume of the reference data table. The compressed data table may be a data table subjected to compression processing, and because the data size corresponding to some data is large in a large data scene, the compressed data table may be subjected to compression processing to generate a compressed data table, and then the compressed data table is subjected to related data processing operations. The File type may be a type of a specific File corresponding to the reference data table, for example, the File type of the data table may include an optimized columnar storage (ORC) File, a data compression algorithm (Lempel-Ziv-oberchimer) File, a compressed File (gz) in a Uinex (UNIX) system, and the like. The file compression rate may be a compression ratio corresponding to data tables of different file types when data compression processing is performed. For example, the file compression ratio of the ORC file may be more than 10 times, the file compression ratio of the LZO file may be more than 3 times, and so on. The real data amount may be a size of a real data amount corresponding to the reference data table. The estimated real data volume may be a data volume obtained by performing estimation processing on a data volume corresponding to the reference data table.

Referring to fig. 2, fig. 2 schematically shows a process diagram for determining the estimated actual data amount of the reference data table. In step S210, when the reference data table is acquired, it may be determined whether the reference data table is a compressed data table formed after compression processing. In step S220, if the reference data table is a compressed data table, the file type corresponding to the reference data table may be determined. In step S230, after the file type is determined, file compression rates respectively corresponding to different file types may be obtained, and the estimated actual data amount of the reference data table may be determined according to the obtained file compression rate and the obtained file type, so as to determine the estimated actual data amount of the reference data table according to the estimation result obtained through estimation.

According to another exemplary embodiment of the present disclosure, a configuration file is obtained, and a configuration parameter corresponding to a to-be-processed data table is determined from the configuration file; and backfilling the configuration parameters according to the estimated real data volume to generate a system file corresponding to the data table to be processed. The configuration file may be a file that may store a threshold corresponding to the data table during the data table connection process. The configuration parameters may be data table parameters corresponding to the data table to be processed. The backfilling process may be a process of performing assignment operation on some configuration parameters in the configuration file according to the determined data information corresponding to the data table. The system file may be a configuration file after the backfill process.

In a data broadcasting phase of performing the data table connection process, a configuration file may be set in the Spark calculation engine, where the configuration file may include a line number threshold for comparing with a data line number of the reference data table and a data amount threshold for comparing with a data amount of the reference data table. For example, the configuration file may be named Sparkconf, the line count threshold may be named spark. sql. auto broadcastjoint rownumber threshold, the data volume threshold may be named spark. sql. auto broadcastjoint threshold, and the line count threshold may be configured with a default of 50 ten thousand, and the data volume threshold may be configured with a default of 10 megabytes (Mbyte, MB).

Further, a data table real data size parameter may be contained in the configuration file, and the parameter may be named as rawDataSize. Referring to fig. 3, fig. 3 schematically shows a process diagram of the backfill process according to the estimated real data amount of the reference data table. In step S310, after the configuration file is obtained from the Spark calculation engine, the configuration parameters may be determined from the configuration file. In step S320, the configuration parameters are backfilled according to the estimated real data amount, for example, the rawDataSize parameters may be backfilled according to the estimated real data amount, and after the backfilling is completed, a system file corresponding to the to-be-processed data table may be generated.

According to another exemplary embodiment of the present disclosure, metadata information corresponding to a to-be-processed data table is obtained, and whether the metadata information includes a preset data table reference field is judged; if the metadata information comprises a data table reference field, acquiring a first data table field based on the metadata information; determining the data table information according to the first data table field. The metadata information may be information reflecting the content related to the data amount corresponding to the data table. The preset data table reference field, namely the statistical information, can determine which connection mode to use for performing table connection processing on the data table to be processed according to the data table reference field. For example, the table reference field may include a number of rows of data, an amount of data after compression of the table, an actual amount of data of the table, and the like. It should be noted that, according to different application scenarios, the corresponding data table reference field may be set, and this disclosure does not make any special limitation thereto. Before the connection processing is performed on the data table to be processed, whether the metadata information contains all the preset data table reference fields or not can be judged, and then the data table connection processing is performed on the data table to be processed. The first data table field may be a data table field obtained from metadata information of the to-be-processed data table when the metadata information includes a complete data table reference field.

Referring to fig. 4, fig. 4 schematically shows a process of acquiring the data table information when the complete statistical information is included in the metadata information. In step S410, when Spark sql is started, the metadata information existing in hive of the data table to be processed, such as the number of data lines (numRows) of the data table, the compressed data size (totalSize) of the data table, the real data size (rawDataSize) of the data table, and the like, may be loaded. After the metadata information is acquired, whether all the preset data table reference fields are contained in the metadata information can be judged. In step S420, if all the data table reference fields are included in the metadata information, the statistical information included in the metadata information may be considered to be complete. If the metadata information contains complete statistical information, the first data table field can be obtained from the metadata information. In step S430, the corresponding data table information is determined according to the first data table field, and whether to perform the hash connection processing operation on the data table to be processed is determined according to the determined data table information.

According to still another exemplary embodiment of the present disclosure, if the metadata information does not include the data table reference field, acquiring a system file corresponding to the to-be-processed data table; determining a second data table field corresponding to the data table to be processed according to the system file; determining the data table information from the second data table field. The second data table field may be a data table field obtained from the system file when the metadata information of the to-be-processed data table does not include a complete data table reference field.

Referring to fig. 5, fig. 5 schematically shows a process of acquiring the data table information when the complete statistical information is not contained in the metadata information. In step S510, if the data table reference field is not included in the metadata information, it may be considered that the relevant data table statistical information is missing in the metadata information, and therefore, the system file corresponding to the data table to be processed may be acquired. In step S520, a second data table field corresponding to the data table to be processed may be determined from the system file. For example, during task execution, the number of data lines of the data table corresponding to the data table to be processed may be obtained, and the relevant information of the data table field may be subjected to supplementary processing by reading information such as File size of a Distributed File System (HDFS). In step S530, the data table information may be determined according to the complementary processing result for the data table field, so as to determine whether to perform the hash connection processing operation on the data table to be processed according to the data table information.

In step S130, it is determined whether the data amount of the reference data table is smaller than the data amount threshold, and it is determined whether the number of data lines of the reference data table is smaller than the number of lines threshold.

In some exemplary embodiments of the present disclosure, the data amount threshold may be a threshold that is size-compared with the data amount of the reference data table. The line number threshold may be a threshold that is sized against the number of data lines of the reference data table.

After the data amount and the data line number of the reference data table are obtained, the data amount and the data amount threshold value and the data line number and the line number threshold value can be respectively compared, so that whether hash connection processing is performed on the data table to be processed or not is determined according to a comparison result.

In step S140, if the data amount of the reference data table is smaller than the data amount threshold and the data line number of the reference data table is smaller than the line number threshold, performing data table connection processing on the reference data table and the data table to be connected in a preset connection manner.

In some exemplary embodiments of the present disclosure, the preset connection manner may be a hash connection manner, and the table connection processing may be performed on the data table to be processed through the hash connection manner. And if the data quantity of the reference data table is smaller than the data quantity threshold value and the data line number of the reference data table is smaller than the line number threshold value, performing data table connection processing on the reference data table and the data table to be connected in a hash connection mode.

For example, when determining whether a table is a hashjoin, the determination may be made according to the following improved policy: the data volume threshold value can be set to be 10MB, the line number threshold value is set to be 10 ten thousand lines, if the data volume size of the field participating in calculation is less than 10MB, and the data line number is judged to be less than the set value of 10 ten thousand lines, according to the calculation model, it can be roughly judged that the data volume for hashjoin occupies less than 30% of the available memory of the Driver during the task execution, so as to guarantee the stability of the memory during the task execution. The Driver node can perform process services such as task starting, initialization, task scheduling, resource application and the like according to the Spark task.

It should be noted that, when the task is executed, the proportion value of the memory occupied by the hashjoin data amount is smaller than the proportion value of the available memory of the Driver, or the proportion value can be manually set according to the set proportion value, so as to ensure the stability of the memory when the task is executed.

The data density may be such that when the same text file is stored in rows, the total number of rows corresponding to different reference data tables is different due to different row sizes. Due to the difference of data density, when the reference data table is converted into the corresponding hash object, the size of the converted hash object may have a larger difference due to the factor of data density. For example, in the process of converting the reference data table into the hash object, the number of data lines of the reference data table a is 10 ten thousand, and the number of data lines of the reference data table B is 1000 ten thousand, and in the process of converting the data lines, each data line can be converted into one object, and each object has a fixed data object size, so that the data size of two reference data tables can be amplified by many times in the process of converting the data, and therefore, before the table connection processing is performed on the data table to be processed, the above-mentioned determination process can be performed.

According to some example embodiments of the present disclosure, a hash object corresponding to a reference data table is determined; broadcasting the hash object to a plurality of computing nodes in a data broadcasting mode; and querying data in the hash object by the data table to be connected according to the target keyword so as to perform data table connection processing in a hash connection mode. The hash object may be a data object generated after an object construction process is performed on a data line in the reference data table. The data broadcasting mode may be a broadcasting mode in which the hash object is broadcast to all the computing nodes, and the hash object is broadcast to all the computing nodes through the data broadcasting mode, so that the computing nodes can perform data calculation according to the hash object. The target keyword may be a keyword used when the data table connection processing is performed with reference to the data table and the data table to be connected. The hash connection mode can be a processing algorithm of a database during multi-table connection, and for two tables, the hash join can be a hash table which is established by taking a small table (called S) in the two tables as a hash table, scanning each line of data of the other table (called M) and mapping the obtained line of data according to connection conditions, wherein the hash table is placed in a memory, and the corresponding line of the S table matched with the M table can be quickly obtained through the method.

Referring to fig. 6, fig. 6 schematically shows a process diagram of hash join processing for a to-be-processed data table. In step S610, after the reference data table is acquired, object construction processing may be performed on the data rows in the reference data table to generate hash objects corresponding to the reference data table. For the data tables with the same data size, the hash objects corresponding to each reference data table are different in size due to different factors such as data types, data columns and the like. In step S620, the generated hash object is broadcasted to all the computing nodes in a data broadcasting manner. In the Spark calculation engine, all tasks contained in Spark can be abstracted into various subtasks, namely task tasks; where the smallest computational logic may be task. A compute node, also known as an Executor node, may be responsible for initiating task tasks. In step S630, the to-be-connected data table stored in the computing node may query the data in the reference data table according to the target keyword, that is, the M table is just like a dictionary, and data association is performed on the data of the query small table, so as to perform table connection processing on the reference data table and the to-be-connected data table in a hash connection manner.

After the judgment strategy is carried out on the data table to be processed, the space size of a Driver memory and an execution memory actually occupied by the data table after data broadcasting can be well combined to execute the task, so that the stability of the service process is ensured to a great extent, the condition of memory overflow is avoided, and the execution success rate of the task is improved.

According to another exemplary embodiment of the present disclosure, a reference data table and a to-be-connected data table are determined from the to-be-processed data table; determining a target keyword; the target keywords comprise keywords corresponding to data table connection processing of the reference data table and the data table to be connected; and performing object construction processing on the reference data table according to the target keyword to generate a hash object. The object building process may be a process of processing data rows in the reference data table to abstract into a hash object. The target keyword, i.e., the key, may be a keyword corresponding to the data table connection processing performed on the reference data table and the data table to be connected, and may determine a connection key used in data connection from the reference data table as the target keyword.

Referring to fig. 7, fig. 7 schematically shows a process diagram for determining a hash object corresponding to a reference data table. In step S710, in acquiring the to-be-processed data table, a reference data table and a to-be-connected data table may be respectively determined from the to-be-processed data table. In step S720, a target keyword used when the reference data table and the to-be-connected data table are subjected to the table connection processing is determined. In step S730, an object construction process is performed on the reference data table according to the determined target keyword and the field data that needs to be used, so as to construct a hash object corresponding to the reference data table.

In conclusion, the data table to be processed is obtained; the data table to be processed comprises a reference data table and a data table to be connected; determining data table information of a data table to be processed; the data table information comprises the data quantity of a reference data table and the data line number of the reference data table; judging whether the data quantity of the reference data table is smaller than a data quantity threshold value or not, and judging whether the data line number of the reference data table is smaller than a line number threshold value or not; and if the data volume of the reference data table is less than the data volume threshold and the data line number of the reference data table is less than the line number threshold, performing data table connection processing on the reference data table and the data table to be connected in a preset connection mode. According to the data table connection processing method, on one hand, whether the data table to be processed is in a Hash connection mode or not is judged by integrating the data quantity, the data line number and the like, so that the scheme for determining the data table connection mode is more reasonable. On the other hand, the user can configure parameters such as the data quantity and the data line number of the data table according to the actual data density, so that the configuration of the data table is more flexible and applicable. In another aspect, the reference data table meeting the preset condition is subjected to table connection processing in a preset connection mode, so that the occurrence probability of memory leakage can be greatly reduced and the task execution efficiency can be improved in the table connection task execution process. On the other hand, in some extreme data scenes, the special tables can be controlled to connect the data tables without adopting a hash connection mode, so that the execution efficiency of data table connection can be improved.

It is noted that although the steps of the methods of the present invention are depicted in the drawings in a particular order, this does not require or imply that the steps must be performed in this particular order, or that all of the depicted steps must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions, etc.

In addition, in the present exemplary embodiment, a data table connection processing apparatus is also provided. Referring to fig. 8, the data table connection processing apparatus 800 may include: a data table obtaining module 810, an information determining module 820, a judging module 830 and a connection processing module 840.

Specifically, the data table obtaining module 810 may be configured to obtain a data table to be processed; the data table to be processed comprises a reference data table and a data table to be connected; the information determining module 820 may be configured to determine the data table information of the data table to be processed; the data table information comprises the data quantity of a reference data table and the data line number of the reference data table; the determining module 830 may be configured to determine whether the data amount of the reference data table is smaller than the data amount threshold, and determine whether the number of data lines of the reference data table is smaller than the number of line threshold; the connection processing module 840 may be configured to perform data table connection processing on the reference data table and the data table to be connected in a preset connection manner if the data amount of the reference data table is smaller than the data amount threshold and the data line number of the reference data table is smaller than the line number threshold.

The data table connection processing device 800 can determine a reference data table and a data table to be connected from the data table to be processed, compare the data amount of the determined reference data table with a data amount threshold, compare the data line number of the reference data table with a line number threshold, and determine whether to adopt a preset connection mode to perform table connection processing on the reference data table and the data table to be processed according to a comparison result, so that the problem that memory leakage is easily caused in the table connection processing process of the data table to be processed due to factors such as unstable data density of the data table to be processed can be greatly reduced, and the data table connection processing device is effective.

In an exemplary embodiment of the present disclosure, the data table connection processing apparatus further includes a data amount estimation module for determining whether the reference data table is a compressed data table; if the reference data table is a compressed data table, determining the file type of the reference data table; and acquiring the file compression ratio, and performing pre-estimation processing on the real data volume of the reference data table according to the file compression ratio and the file type to determine the pre-estimated real data volume of the reference data table.

In an exemplary embodiment of the present disclosure, the data table connection processing apparatus further includes a file backfill module, configured to obtain a configuration file, and determine a configuration parameter corresponding to the data table to be processed from the configuration file; and backfilling the configuration parameters according to the estimated real data volume to generate a system file corresponding to the data table to be processed.

In an exemplary embodiment of the present disclosure, the information determining module includes a first information determining unit, configured to obtain metadata information corresponding to a to-be-processed data table, and determine whether the metadata information includes a preset data table reference field; if the metadata information comprises a data table reference field, acquiring a first data table field based on the metadata information; determining the data table information according to the first data table field.

In an exemplary embodiment of the present disclosure, the information determining module includes a second information determining unit, configured to acquire a system file corresponding to the to-be-processed data table if the metadata information does not include the data table reference field; determining a second data table field corresponding to the data table to be processed according to the system file; determining the data table information from the second data table field.

In an exemplary embodiment of the present disclosure, the connection processing module includes a connection processing unit for determining a hash object corresponding to the reference data table; broadcasting the hash object to a plurality of computing nodes in a data broadcasting mode; and querying data in the hash object by the data table to be connected according to the target keyword so as to perform data table connection processing in a hash connection mode.

In an exemplary embodiment of the present disclosure, the connection processing unit includes an object generation subunit configured to determine a reference data table and a to-be-connected data table from the to-be-processed data table; determining a target keyword; the target keywords comprise keywords corresponding to data table connection processing of the reference data table and the data table to be connected; and performing object construction processing on the reference data table according to the target keyword to generate a hash object.

The specific details of each virtual data table connection processing device module are already described in detail in the corresponding data table connection processing method, and therefore are not described herein again.

It should be noted that although in the above detailed description reference is made to a data table connecting several modules or units of the processing means, this division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.

In addition, in an exemplary embodiment of the present disclosure, an electronic device capable of implementing the above method is also provided.

As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or program product. Thus, various aspects of the invention may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.

An electronic device 900 according to such an embodiment of the invention is described below with reference to fig. 9. The electronic device 900 shown in fig. 9 is only an example and should not bring any limitations to the function and scope of use of the embodiments of the present invention.

As shown in fig. 9, the electronic device 900 is embodied in the form of a general purpose computing device. Components of electronic device 900 may include, but are not limited to: the at least one processing unit 910, the at least one storage unit 920, a bus 930 connecting different system components (including the storage unit 920 and the processing unit 910), and a display unit 940.

Wherein the storage unit stores program code that is executable by the processing unit 910 to cause the processing unit 910 to perform steps according to various exemplary embodiments of the present invention described in the above section "exemplary methods" of the present specification.

The storage unit 920 may include readable media in the form of volatile memory units, such as a random access memory unit (RAM) 921 and/or a cache memory unit 922, and may further include a read only memory unit (ROM) 923.

Storage unit 920 may include a program/utility 924 having a set (at least one) of program modules 925, such program modules 925 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

Bus 930 may represent one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device 900 may also communicate with one or more external devices 970 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 900, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 900 to communicate with one or more other computing devices. Such communication may occur via input/output (I/O) interface 950. Also, the electronic device 900 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet) via the network adapter 960. As shown, the network adapter 960 communicates with the other modules of the electronic device 900 via the bus 930. It should be appreciated that although not shown, other hardware and/or software modules may be used in conjunction with the electronic device 900, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.

In an exemplary embodiment of the present disclosure, there is also provided a computer-readable storage medium having stored thereon a program product capable of implementing the above-described method of the present specification. In some possible embodiments, aspects of the invention may also be implemented in the form of a program product comprising program code means for causing a terminal device to carry out the steps according to various exemplary embodiments of the invention described in the above-mentioned "exemplary methods" section of the present description, when said program product is run on the terminal device.

Referring to fig. 10, a program product 1000 for implementing the above method according to an embodiment of the present invention is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present invention is not limited in this regard and, in the present document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

A computer readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

Furthermore, the above-described figures are merely schematic illustrations of processes involved in methods according to exemplary embodiments of the invention, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is to be limited only by the terms of the appended claims.

Claims

1. A data table connection processing method is characterized by comprising the following steps:

acquiring a data table to be processed; the data table to be processed comprises a reference data table and a data table to be connected;

determining data table information of the data table to be processed according to metadata information or system files corresponding to the data table to be processed; wherein the data table information includes the data amount of the reference data table and the data line number of the reference data table; the data amount of the reference data table is determined according to the file type and the file compression rate of the reference data table;

judging whether the data volume of the reference data table is smaller than a data volume threshold value or not, and judging whether the data line number of the reference data table is smaller than a line number threshold value or not;

if the data volume of the reference data table is smaller than a data volume threshold value and the data line number of the reference data table is smaller than a line number threshold value, performing data table connection processing on the reference data table and the data table to be connected in a preset connection mode;

determining the data table information through the metadata information corresponding to the data table to be processed includes: acquiring metadata information corresponding to the to-be-processed data table, and judging whether the metadata information comprises a preset data table reference field; if the metadata information comprises the data table reference field, acquiring a first data table field based on the metadata information; determining the data table information according to the first data table field;

the determining the data table information through the system file corresponding to the data table to be processed includes: if the metadata information does not comprise the data table reference field, acquiring a system file corresponding to the data table to be processed; determining a second data table field corresponding to the data table to be processed according to the system file; determining the data table information according to the second data table field;

the system file is generated by the following method:

estimating the real data volume of the reference data table according to the file compression rate and the file type to determine the estimated real data volume of the reference data table; acquiring a configuration file, and determining configuration parameters corresponding to the data table to be processed from the configuration file; and backfilling the configuration parameters according to the estimated real data volume to generate a system file corresponding to the data table to be processed.

2. The data table connection processing method according to claim 1, wherein the preset connection manner includes a hash connection manner, and performing data table connection processing on the reference data table and the data table to be connected by the preset connection manner includes:

determining a hash object corresponding to the reference data table;

broadcasting the hash object to a plurality of computing nodes in a data broadcasting mode;

and querying data in the hash object by the to-be-connected data table according to the target keyword so as to perform data table connection processing in the hash connection mode.

3. The data table connection processing method according to claim 2, wherein the determining a hash object corresponding to the reference data table includes:

determining the reference data table and the data table to be connected from the data table to be processed;

determining a target keyword; the target keywords comprise keywords corresponding to the data table connection processing of the reference data table and the data table to be connected;

and performing object construction processing on the reference data table according to the target keyword to generate the hash object.

4. A data table connection processing apparatus, comprising:

the data table acquisition module is used for acquiring a data table to be processed; the data table to be processed comprises a reference data table and a data table to be connected;

the information determining module is used for determining the data table information of the data table to be processed through the metadata information or the system file corresponding to the data table to be processed; wherein the data table information includes the data amount of the reference data table and the data line number of the reference data table; the data amount of the reference data table is determined according to the file type and the file compression rate of the reference data table;

the judging module is used for judging whether the data volume of the reference data table is smaller than a data volume threshold value or not and judging whether the data line number of the reference data table is smaller than a line number threshold value or not;

the connection processing module is used for performing data table connection processing on the reference data table and the data table to be connected in a preset connection mode if the data amount of the reference data table is smaller than a data amount threshold and the data line number of the reference data table is smaller than a line number threshold;

the information determining module comprises a first information determining unit, a second information determining unit and a processing unit, wherein the first information determining unit is used for acquiring metadata information corresponding to the to-be-processed data table and judging whether the metadata information comprises a preset data table reference field or not; if the metadata information comprises the data table reference field, acquiring a first data table field based on the metadata information; determining the data table information according to the first data table field;

the information determining module comprises a second information determining unit, and is used for acquiring a system file corresponding to the to-be-processed data table if the metadata information does not comprise the data table reference field; determining a second data table field corresponding to the data table to be processed according to the system file; determining the data table information according to the second data table field;

the information determining module comprises a system file generating unit, and is used for performing pre-estimation processing on the real data volume of the reference data table according to the file compression rate and the file type so as to determine the pre-estimated real data volume of the reference data table; acquiring a configuration file, and determining configuration parameters corresponding to the data table to be processed from the configuration file; and backfilling the configuration parameters according to the estimated real data volume to generate a system file corresponding to the data table to be processed.

5. An electronic device, comprising:

a processor; and

a memory having stored thereon computer readable instructions which, when executed by the processor, implement a data table join processing method according to any of claims 1 to 3.

6. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, implements a data table connection processing method according to any one of claims 1 to 3.