CN116932528A

CN116932528A - Data table splicing method, device, equipment and storage medium

Info

Publication number: CN116932528A
Application number: CN202210330901.1A
Authority: CN
Inventors: 曹路洋
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-03-30
Filing date: 2022-03-30
Publication date: 2023-10-24

Abstract

The application discloses a splicing method, device and equipment of a data table and a storage medium, and belongs to the technical field of computers. The method comprises the following steps: in the processing process of stream data, a first data table and k second data tables are obtained; splitting the first data table according to p time periods to obtain p first sub-data tables; according to the p time periods, splicing the data belonging to the same object in the p first sub-data tables and the k second data tables to obtain p x k third data tables; and splicing the p x k third data tables. Splitting the first data table through p time periods, and then splicing the splitting result of the first data table with k second data tables according to the p time periods, so that the data tables can be spliced in a multithreading manner. Therefore, the scale of splicing calculation of the data table is reduced, and the problems of memory overflow, overtime and the like in the splicing process can be avoided. The success rate and the stability of the splicing calculation of the data table are improved.

Description

Data table splicing method, device, equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a storage medium for splicing data tables.

Background

In the fields of machine learning, algorithm training for deep learning, data processing, and the like, a large amount of data (large data) stored in the form of a large amount of data tables is generally involved, and it is necessary to perform a concatenation process of the data tables before using the large data.

In the related art, a Join operator in an apathy Spark (Apache Spark) is generally adopted to perform a splicing calculation on at least two data tables, so that the splicing of the at least two data tables is realized. For example, the computer device performs a Left Join (Left Join) calculation on the first data table and the k second data tables, thereby obtaining an oversized wide table. Specifically, the computer device performs the splicing calculation on the first data table and the 1 st second data table, then performs the splicing calculation on the splicing calculation result and the 2 nd second data table, and then loops the above process until the splicing calculation with the kth second data table is completed. The calculated data table can be used for subsequent machine learning and deep learning algorithm training or reasoning.

The data table is spliced in the mode, so that large-scale calculation can be generated, the problems of overflow, overtime and the like of an actuator (Executor) node memory for executing a splicing task in computer equipment can be caused, and splicing calculation failure can be caused.

Disclosure of Invention

The application provides a splicing method, a device, equipment and a storage medium of a data table, which can improve the success rate and the stability of splicing calculation of the data table. The technical scheme is as follows:

according to an aspect of the present application, there is provided a method for splicing data tables, the method comprising:

in the processing process of stream data, a first data table and k second data tables are acquired, wherein the first data table and the second data tables store data of the same object acquired according to a time dimension, and k is a positive integer;

splitting the first data table according to p time periods to obtain p first sub-data tables, wherein p is a positive integer greater than 1;

splicing the data belonging to the same object in the p first sub-data tables and the k second data tables according to the p time periods to obtain p x k third data tables;

and splicing the p x k third data tables.

According to another aspect of the present application, there is provided a method and apparatus for splicing data tables, the apparatus including:

the acquisition module is used for acquiring a first data table and k second data tables in the processing process of the streaming data, wherein the first data table and the second data tables store data of the same object acquired according to the time dimension, and k is a positive integer;

The splitting module is used for splitting the first data table according to p time periods to obtain p first sub data tables, wherein p is a positive integer greater than 1;

the splicing module is used for splicing the data belonging to the same object in the p first sub-data tables and the k second data tables according to the p time periods to obtain p x k third data tables;

and the splicing module is further used for splicing the p x k third data tables.

In an optional design, the splitting module is configured to split each second data table in the k second data tables according to the p time periods to obtain k second sub-data tables corresponding to each time period in the p time periods;

the splicing module is used for splicing the target first sub-data table corresponding to the ith period in the p periods and the k target second sub-data tables corresponding to the mth period in the p periods, i and m are positive integers not greater than p, and i and m are identical or different.

In an alternative design, the splicing module is configured to:

and respectively splicing each target first sub-data table corresponding to the ith time period in the p time periods and each target second sub-data table in the k target second sub-data tables corresponding to the mth time period in the p time periods.

In an alternative design, the apparatus further comprises:

the compression module is used for respectively carrying out compression calculation on the p first sub-data tables to obtain p first compressed data tables;

the compression module is further used for respectively performing compression calculation on the k second data tables to obtain k second compressed data tables;

and the splicing module is used for splicing the data belonging to the same object in the p first compression data tables and the k second compression data tables according to the p time periods.

In an alternative design, the compression module is configured to:

according to the data type of the data in the target first sub-data table corresponding to the ith period in the p periods, performing compression calculation on the target first sub-data table to obtain a first compression data table corresponding to the ith period in the p periods, wherein i is a positive integer not more than p;

and respectively performing compression calculation on the second data tables spliced with the target first sub data in the k second data tables by using the same compression calculation mode as the target first sub data table.

In an alternative design, the compression module is configured to:

Under the condition that the data type is an integer type or a large integer type, processing the target first sub-data table by using a processing mode of a high-efficiency compression bitmap to obtain a first compression data table corresponding to an ith period in the p periods;

or alternatively, the first and second heat exchangers may be,

and under the condition that the data type is a character string type, using a bloom filter to process the target first sub-data table to obtain a first compressed data table corresponding to the ith period in the p periods.

In an alternative design, the splicing module is configured to:

determining a target connection operator from a plurality of connection operators according to the characteristics of the first target sub-data table and the k second target sub-data tables;

and splicing the first target sub-data table with the k second target sub-data tables by using the target connection operator.

In an alternative design, the apparatus further comprises:

the storage module is used for storing splicing check point data in the process of splicing the p first sub-data tables and the k second data tables;

the splicing check point data are used for reflecting current splicing progress records of the p first sub-data tables and the k second data tables, and the splicing progress records are used for recovering the current splicing progress of the p first sub-data tables and the k second data tables under the condition of splicing interruption.

In an alternative design, the apparatus further comprises:

the processing module is used for carrying out data distribution consistency processing on the third data table, and the data distribution consistency processing is used for carrying out data alignment and data sequencing of different data tables;

and the splicing module is used for splicing the processed third data table.

In an alternative design, the policy for processing the data distribution consistency includes at least one of:

sorting data;

data is divided into barrels;

the number of the stored files is recorded.

In an alternative design, the components for performing data table stitching include an upstream component and a downstream component, where the upstream component is a component for processing to generate the third data table, and the downstream component is a component for stitching the third data table; the splicing module is used for:

transmitting metadata of the policy to a downstream component;

splicing the processed third data table through the downstream component;

wherein the metadata of the policy is used by the downstream component to read the processed third data table using the policy.

In an alternative design, the splicing module is configured to:

And in the k second sub-data tables corresponding to each of the p time periods, if the first sub-data table and the second sub-data table with the mapping relation exist, simultaneously splicing the first sub-data table and the second sub-data table with the mapping relation.

In an alternative design, the first data table is a sample table and the second data table is a feature table;

the sample table is used for storing classification of sample objects, and the feature table is used for storing features of the sample objects.

According to another aspect of the present application there is provided a computer device comprising a processor and a memory having stored therein at least one instruction, at least one program, code set or instruction set loaded and executed by the processor to implement a method of splicing a data table as described in the above aspect.

According to another aspect of the present application, there is provided a computer readable storage medium having stored therein at least one instruction, at least one program, a set of codes or a set of instructions, the at least one instruction, the at least one program, the set of codes or the set of instructions being loaded and executed by a processor to implement a method of splicing a data table as described in the above aspect.

According to another aspect of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions are read from the computer-readable storage medium by a processor of a computer device, and executed by the processor, cause the computer device to perform the method of splicing the data tables provided in various alternative implementations of the above aspects.

The technical scheme provided by the application has the beneficial effects that at least:

splitting the first data table through p time periods, and then splicing the splitting result of the first data table with k second data tables according to the p time periods, so that the data tables can be spliced in a multithreading manner. Therefore, the scale of splicing calculation of the data table is reduced, and the problems of memory overflow, overtime and the like in the splicing process can be avoided. The success rate and the stability of the splicing calculation of the data table are improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a process for concatenating a data table provided by an exemplary embodiment of the application;

FIG. 2 is a schematic diagram of a framework for implementing data table stitching provided in accordance with an exemplary embodiment of the present application;

FIG. 3 is a flow chart of a method for splicing data tables according to an exemplary embodiment of the present application;

FIG. 4 is a flow chart of a method for splicing data tables according to an exemplary embodiment of the present application;

FIG. 5 is a schematic diagram of a process for stitching a data table provided in accordance with an exemplary embodiment of the present application;

FIG. 6 is a diagram illustrating a compression and concatenation process for a data table according to an exemplary embodiment of the present application;

FIG. 7 is a schematic diagram of a data table processing and stitching process provided by an exemplary embodiment of the present application;

FIG. 8 is a schematic diagram of a process for stitching a data table provided in accordance with an exemplary embodiment of the present application;

FIG. 9 is a schematic diagram of a splicing apparatus for data sheets according to an exemplary embodiment of the present application;

FIG. 10 is a schematic diagram of a splicing apparatus for data sheets according to an exemplary embodiment of the present application;

FIG. 11 is a schematic diagram of a splicing apparatus for data sheets according to an exemplary embodiment of the present application;

FIG. 12 is a schematic diagram of a splicing apparatus for data sheets according to an exemplary embodiment of the present application;

fig. 13 is a schematic structural view of a computer device according to an exemplary embodiment of the present application.

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail with reference to the accompanying drawings.

First, the terms related to the present application will be described:

spark: spark, also known as Apache Spark, is a fast and versatile computing engine designed for large-scale data processing.

Hive (a data warehouse tool): hive is a Hadoop (a distributed system infrastructure) based data warehouse tool for data extraction, transformation, and loading. Hive provides a mechanism by which large-scale data stored in Hadoop-based systems can be stored, queried, and analyzed.

Join (connect): join in structured query language (Structured Query Language, SQL) is used to query data from two or more data tables according to relationships between columns in the tables. Wherein columns in the data table are fields, the values of the behavior fields. In the case where a complete query result needs to be obtained, the result needs to be obtained from two or more data tables, at which time Join needs to be performed.

check point: the method is used for storing metadata of breakpoint continuous running and storing data cache records so as to recover when a process is abnormal by using check points.

Left Join (Left Join): during data analysis and processing, join (Spark Join) operations are often used to correlate/splice data tables. For example, the job operation is input into two data tables a and B, each record (data) in the data table a is compared with each record in the data table B, and each time a satisfactory record is found, a new record is returned, and the fields in the new record may be only from the data table a, or only from the data table B, or may be partially taken from the data tables a and B, respectively. Thus, the data table obtained after Join operation may represent a combination of records in both data tables. Left Join is one of the external connections (Outer Join) in Spark Join. The external connection can also include a Right Join. Left Join is to connect the Left of table 1 in the two data tables to table 2, i.e. the Left is mainly table 1, and the data of table 2 above is associated, and the output result includes all data of the Left table and data of the intersection part of the right table and the Left table. Right Join is based on the Right table. Spark Join can also include an internal connection (Inner Join), representing that the intersection of the two tables predominates, and that the output is the portion of the two tables where the intersection exists, and that the rest is not associated.

Shuffle (a process): the Shuffle in Spark describes the process of data output from Map task to Reduce task input. The Shuffle is a bridge connecting the Map and Reduce, the Map stage reads data through the Shuffle and outputs the data to the corresponding Reduce, and the Reduce stage is responsible for pulling data from the Map end and performing computation.

Executor: is the execution unit of a task in Spark, which is actually a collection of computing resources (including processor cores and memory).

Roaring bitmap: is an evolution of bitmaps, i.e., compressed bitmaps. Each bit in the bitmap is used to store a state that is applicable to large-scale data, but the data state is not much the case. Typically to determine whether a data store exists.

Bloom filter: is a very space efficient random data structure that uses bit arrays to represent a collection very succinctly and can determine whether an element belongs to the collection.

Introducing the splicing of the data table:

in the fields of machine learning, algorithm training for deep learning, data processing, and the like, a large amount of data (large data) stored in the form of a large amount of data tables is generally involved, and it is necessary to perform a concatenation process of the data tables before using the large data. In the related art, the data can be spliced by using a Spark Join method. Specifically, the computer device performs the Left Join splicing calculation on the first data table and the second data tables 1 to k, and the result of the splicing calculation is stored into an ultra-large wide table (more columns), so that the result of the splicing calculation can be used for subsequent machine learning and deep learning algorithm training or reasoning.

Illustratively, fig. 1 is a schematic diagram of a process for splicing data tables according to an exemplary embodiment of the present application. As shown in fig. 1, the computer device concatenates a first data table 101 with k second data tables 102. In this process, the computer device performs the splicing calculation on the 1 st second data table 102 of the first data table 101, then performs the splicing calculation on the splicing calculation result and the 2 nd second data table 102, and loops the above process until the splicing calculation with the kth second data table 102 is completed, thereby obtaining a splicing result 103, where the splicing result 103 is usually an oversized wide table. The process of stitching computation would use the operators used to implement the Left Join.

The above-described scheme may cause a series of problems in the process of performing the table splice calculation. First, for an ultra-high-dimensional dense data structure (second data table), too many columns and rows result in a distributed Spark that is difficult to support for operation. The main reason is that metadata management for allocation tasks is difficult, resulting in failure, time-out of the entire computing task. Secondly, the data table in tens to hundreds of hives has huge data volume, so that the consumption of calculation power, network transmission resources and the like used in the steps of the Shuffle and the Sort in the whole splicing calculation is huge, and the problems of memory overflow, overtime and the like of an execution (Executor) node for executing the splicing task are easily caused, thereby causing the splicing calculation failure. Third, since the data tables in Hive are stored on hdfs (distributed file system), a table may have thousands of data blocks (blocks) that may fail query acquisition due to some instability factors, whereas anomalies in a single block may fail the entire full task, resulting in the stability of the overall splice calculation being compromised. Fourth, the data calculation consumption is large, the whole splicing calculation speed is slow, and the iteration efficiency of the model algorithm may be affected, for example, the training time of the splicing time-consuming far-exceeding model algorithm occurs.

Compared with the original Spark Join mode, the method provided by the embodiment of the application provides a new frame oriented to a general scene, and the whole data splicing process improves efficiency and operation stability by means of filtering acceleration, caching check point data and the like, and simultaneously enables the whole data splicing to support larger-scale data. By solving the problems, the Spark/Hive-based data splicing frame can be applied to a real service scene. And meanwhile, high-efficiency and stable spliced data are provided for the algorithm of the scenes.

FIG. 2 is a schematic diagram of a framework for implementing data table stitching in accordance with an exemplary embodiment of the present application. As shown in fig. 2, in the process of processing streaming data, the computer device performs concatenation of a first data table and k second data tables, where the first data table and the second data table store data of the same object acquired according to a time dimension. In the data layer 201, the computer device performs data normalization processing on the first data table and the second data table. And splitting the first data table into p first sub-data tables according to p time periods, and splitting each second data table in the k second data tables according to p time periods to obtain k second sub-data tables corresponding to each time period in the p time periods. And generating a splicing task for a target first sub-data table corresponding to the ith period in the p periods and k target second sub-data tables corresponding to the mth period in the p periods, so as to realize the generation of a multithreading task. Wherein i and m are positive integers not greater than p, and i and m are the same or different.

In the execution layer 202, the computer device performs compression calculation on the target first sub-data table according to the data type of the data in the target first sub-data table, to obtain a first compressed data table. And respectively carrying out compression calculation on the k target second sub-data tables by using the same compression calculation mode as the target first sub-data table to obtain k second compression data tables. And the target connection operator is adaptively determined in a plurality of connection operators according to the characteristics of the first compression data table and the k second compression data tables. And the first compressed data table and the k second compressed data tables are respectively spliced by using the target connection operator, so that p (n) third data tables are obtained, and the data tables are spliced by using the connection operator based on the filter (compression).

In the storage layer 203, the computer device performs processing of data distribution consistency on the third data table, and stores metadata of policies of the processing of data distribution consistency. Optionally, the processing of data distribution consistency is used to perform data alignment and data ordering of different data tables. And then, during splicing, reading the processed third data table by using the strategy indicated by the metadata, and splicing the processed third data table, so that a final splicing result is obtained.

Splitting the first data table through p time periods, and then splicing the splitting result of the first data table with k second data tables according to the p time periods, so that the data tables can be spliced in a multithreading manner. Therefore, the scale of splicing calculation of the data table is reduced, and the problems of memory overflow, overtime and the like in the splicing process can be avoided. The success rate and the stability of splicing calculation are improved. The framework provided by the embodiment of the application reduces the development cost of algorithm engineers. In addition, due to the uniformity and standardization of the framework, a certain accuracy guarantee is provided for the calculation result of the data splicing.

Fig. 3 is a flowchart of a method for splicing data tables according to an exemplary embodiment of the present application. The method may be used with a computer device or a client on a computer device. As shown in fig. 3, the method includes:

step 302: in the processing process of the streaming data, a first data table and k second data tables are acquired.

The first data table and the second data table store data of the same object acquired according to the time dimension, and k is a positive integer. Optionally, the fields related to the data stored in the first data table and the second data table include financial fields, social fields, online shopping fields, instant messaging fields, game fields, video fields, mobile payment fields, and the like. The first data table is a sample table, and the second data table is a feature table. The sample table is used for storing the classification of the sample object, and the feature table is used for storing the features of the sample object.

Illustratively, the first data table and the second data table store data of the financial domain. Wherein, the first data table stores data reflecting whether the sample object has fraudulent activity. The second data table stores data of the sample object such as consumption, loan and the like.

Optionally, the computer device obtains the first data table and the second data table through locally stored data. Alternatively, the computer device obtains the first data table and the second data table through other computer devices. Alternatively, the computer device obtains the first data table and the second data table via locally stored data and other computer devices.

Step 304: and splitting the first data table according to p time periods to obtain p first sub-data tables.

p is a positive integer greater than 1. The period is set in the computer device, for example manually. Illustratively, the time period is set by date, one time period per day, or one time period per week, per week. Optionally, the data in the first data table corresponds to an acquisition time, and the computer device splits the first data table according to p time periods according to the acquisition time, so as to obtain p first sub-data tables.

For example, the first data table includes 10 pieces of data, and the computer device determines that 1 st, 2 nd, and 3 rd pieces of data belong to period 1, 4 th, and 5 th pieces of data belong to period 2, 6 th, and 7 th pieces of data belong to period 3, 8 th piece of data belong to period 4, and 9 th, and 10 th pieces of data belong to period 5 when dividing the first data by 5 periods. The splitting result is that the first sub data table 1 comprises 1 st, 2 nd and 3 rd data, the first sub data table 2 comprises 4 th and 5 th data, the first sub data table 3 comprises 6 th and 7 th data, the first sub data table 4 comprises 8 th data, and the first sub data table 5 comprises 9 th and 10 th data.

Step 306: and according to the p time periods, splicing the data belonging to the same object in the p first sub-data tables and the k second data tables to obtain p x k third data tables.

The computer equipment respectively splices the first sub-data table of the ith period in the p periods with the k second data tables, so that k third data tables corresponding to the ith period can be obtained. According to the method, the p first sub-data tables and the k second data tables are spliced, so that k third data tables corresponding to each of the p time periods can be obtained, and p x k third data tables are obtained.

Optionally, the computer device further splits each second data table in the k second data tables according to p time periods to obtain k second sub-data tables corresponding to each time period in the p time periods. And then splicing the first sub-data table corresponding to the ith period in the p periods and the k second sub-data tables corresponding to the mth period in the p periods, so as to obtain p x k third data tables. Wherein i and m are positive integers not greater than p, and i and m are the same or different. Optionally, the process of splicing the first data table and the second data table by the computer device is implemented by a Join operator.

Step 308: and splicing the p x k third data tables.

After obtaining the p×k third data tables, the computer device splices the p×k third data tables, so as to finish splicing the first data table and the k second data tables. Optionally, the process of splicing the third data table by the computer device is implemented by a Join operator.

Alternatively, the method provided by the embodiment of the application can be applied to a distributed system constructed by a computing engine based on large-scale data processing and a distributed data storage system. Specifically, the method provided by the embodiment of the application is executed by the node in the system. For example, in a distributed system of Spark/Hive based data splice frameworks. The distributed system comprises a plurality of nodes, each node comprising at least one server (computer device), the nodes being connected by wireless or wired means. And the different nodes cooperate to execute the method, so that the splicing of the data table is realized. For example, in the process of splicing the first data table and the k second data tables in the above manner, p×k splicing tasks for splicing to obtain the third data tables are generated, and the p×k splicing tasks are distributed to different nodes to be executed, so as to obtain the p×k third data tables. And then summarizing p x k third data tables for splicing, thereby obtaining a final splicing result.

In summary, according to the method provided by the embodiment, the first data table is split through p time periods, and then the splitting result of the first data table is spliced with k second data tables according to the p time periods, so that the data table can be spliced in a multithreading manner. Therefore, the scale of splicing calculation of the data table is reduced, and the problems of memory overflow, overtime and the like in the splicing process can be avoided. The success rate and the stability of the splicing calculation of the data table are improved.

Fig. 4 is a flowchart of a method for splicing data tables according to an exemplary embodiment of the present application. The method may be used with a computer device or a client on a computer device. As shown in fig. 4, the method includes:

step 402: in the processing process of the streaming data, a first data table and k second data tables are acquired.

The first data table and the second data table store data of the same object acquired according to the time dimension, and k is a positive integer. Optionally, the first data table is a sample table and the second data table is a feature table. The sample table is used for storing the classification of the sample object, and the feature table is used for storing the features of the sample object.

Step 404: and splitting the first data table according to p time periods to obtain p first sub-data tables.

Step 406: splitting each second data table in the k second data tables according to p time periods to obtain k second sub-data tables corresponding to each time period in the p time periods.

In the process of splitting the second data table, the computer equipment splits the nth second data table to obtain k second sub-data tables corresponding to the nth second data table, wherein n is a positive integer not greater than k. By splitting each second data table in the above manner, k second sub-data tables corresponding to each of the p time periods can be obtained.

Step 408: and splicing the target first sub-data table corresponding to the ith period in the p periods and the k target second sub-data tables corresponding to the mth period in the p periods to obtain p x k third data tables.

Wherein i and m are positive integers not greater than p, and i and m are the same or different. In the process of splicing the first sub-data table and the second sub-data table, the splicing relationship between the first sub-data table and the second sub-data table in the k second sub-data tables corresponding to the p first sub-data tables and the p time periods is determined by computer equipment, for example, is determined manually by the computer equipment. I.e. i above and m above are determined by the computer device, i can be equal to m, or i can also be unequal to m. In case there is no mapping between i and m, a specific rule for the splice needs to be determined by the computer device.

Optionally, in the process of splicing, the computer device splices the target first sub-data table corresponding to the ith period in the p periods and each of the k target second sub-data tables corresponding to the mth period in the p periods respectively. For example, the computer device splices the target first sub-data table with 1 st of the k target second sub-data tables corresponding to the m-th period of the p periods, and splices 2 nd of the k target second sub-data tables corresponding to the m-th period of the p periods, and in the above manner, until the computer device splices the target first sub-data table with the k of the k target second sub-data tables corresponding to the m-th period of the p periods, the calculation of the splice between the target first sub-data table and the k target second sub-data tables corresponding to the m-th period of the p periods is considered to be completed.

Optionally, in the process of stitching the p first sub-data tables with the k second data tables, the computer device may also store stitching checkpoint data. The splicing check point data are used for reflecting current splicing progress records of the p first sub-data tables and the k second data tables, and the splicing progress records are used for recovering current splicing progress of the p first sub-data tables and the k second data tables under the condition of splicing interruption. In the above manner of concatenating the first target sub-table and the k target second sub-tables, the computer device stores the concatenated checkpoint data of the first target sub-table and the k second target sub-tables. For example, splice checkpoint data is stored that splices the first target-sub-data table with the 3 rd splice task of the k second target-sub-data tables.

Fig. 5 is a schematic diagram illustrating a splicing process of a data table according to an exemplary embodiment of the present application. As shown in fig. 5, the computer device splits the first data table 501 according to p time periods to obtain p first sub-data tables. The computer device splits k second data tables 502 according to p time periods to obtain k second sub-data tables corresponding to the p time periods. And generating a target first sub-data table corresponding to the ith period in the p periods, and performing multi-thread tasks of splicing the k target second sub-data tables corresponding to the mth period in the p periods, wherein the number of the tasks is p. The computer equipment can make full use of computing resources to perform concurrent execution by dividing and controlling the spliced tasks, so that resource waiting waste of a plurality of nodes during task queuing is avoided. Under the framework of dividing and controlling, the storage aspect of the data is relatively improved, and the storage is divided into horizontal (column-type field) and horizontal (time dimension division) storage, so that the whole-scale expansion can be supported. During the stitching process, the computer device also stores the stitching checkpoint data. By the above concatenation, p×k third data tables can be obtained. The third data Table may also be referred to as a Mini Table (Mini Table).

Compression filtering for splicing process:

in the process of splicing the first data table and the k second data tables, the computer equipment also performs compression calculation on the data tables. For the splicing of the ultra-large scale data table, the problem that the speed is low and the data volume in the first data table and the second data table is large is solved, so that the time linear multiple of the splicing calculation is increased, and the splicing efficiency is very influenced. The main reasons for the inefficiency are that the calculation amount is large, and most of the calculation data are irrelevant, for example, in the first data table and the second data table, there is data which has no association relation for the same object, for example, data with completely different data types. By carrying out compression calculation on the data table, the scale of splicing calculation can be reduced, and therefore the splicing efficiency is improved. For example, the p first sub-data tables are respectively compressed and calculated to obtain p first compressed data tables. And respectively carrying out compression calculation on the k second data tables to obtain k second compressed data tables. And then, according to the p time periods, splicing the data belonging to the same object in the p first compression data tables and the k second compression data tables. In the above manner of concatenating the first target sub-table and the k target second sub-tables, the computer device performs compression calculation on the first target sub-table and the k target second sub-tables according to the above manner.

Optionally, the computer device performs compression calculation on the target first sub-data table according to the data type of the data in the target first sub-data table corresponding to the ith period in the p periods, so as to obtain a first compressed data table corresponding to the ith period in the p periods. And respectively performing compression calculation on the second data tables spliced with the target first sub-data in the k second data tables by using the same compression calculation mode as the target first sub-data table. For example, the computer device performs compression calculation on the first target sub-data table according to the data type of the data in the first target sub-data table. And adopting the compression calculation mode to respectively perform compression calculation on the k target second sub-data tables.

Optionally, in the case that the data type is an integer type (Int) or a large integer type (Big Int), the computer device may process the target first sub-data table by using a processing manner of a high-efficiency compression Bitmap (routing Bitmap), to obtain a first compressed data table corresponding to an i-th period of the p periods. Or, in the case that the data type is a String type, the computer device uses a Bloom Filter (Bloom Filter) to process the target first sub-data table, so as to obtain a first compressed data table corresponding to the ith time period in the p time periods. For example, when data is used for model reasoning, the computer device may select Filter Join (the way the Roaringbitmap is processed) for compression calculations, as with the BigInt type of user list data, such as user accounts. In an exemplary process of splicing the Data table, the computer device compresses and dynamically acquires the filtered Data of the Data table according to the Data format of the Data, pushes down the filtered Data to a Data Source (Data Source) end, and performs splicing calculation of the Data. And different processing modes are selected for different data types, so that the data can be flexibly processed. And the same processing mode is used for processing the target first sub data table and the target second sub data table, so that the data correspondence can be ensured. During execution, the computer device multiplexes the above processing manner and the result of the processing, so as to further improve efficiency.

Selection of a stitching operator:

in the process of splicing the first target sub-data table and the k second target sub-data tables, the computer equipment determines a target connection operator from a plurality of connection operators according to the characteristics of the first target sub-data table and the k second target sub-data tables. And splicing the first target sub-data table with the k second target sub-data tables by using the target connection operator. Optionally, the features include a spliced scene, a data size, a collision rate problem, a primary key type of the data, filtering efficiency and the like, and the computer device selects different Join operators according to different features. Illustratively, a sort merge (sort merge) Join operator is selected for a scene with filtering efficiency below 85%.

Illustratively, FIG. 6 is a schematic diagram of a compression and concatenation process for a data table provided by an exemplary embodiment of the present application. As shown in fig. 6, in the process of concatenating the p first sub-data tables 601 and the k second sub-data tables 602 corresponding to the p periods, respectively. The computer device determines the compression calculation mode of the target first sub-data table, performs compression calculation on the target first sub-data table, and performs compression calculation on the target second sub-data table spliced with the target first sub-data table in the same mode. The computer device then performs a splice calculation on the first compressed data table 603 and the second compressed data table 604, and in the process, the computer device adaptively selects a Join operator.

Step 410: and splicing the p x k third data tables.

Optionally, in the process of splicing the third data table, the computer device performs a data distribution consistency process on the third data table, where the data distribution consistency process is used for performing data alignment and data ordering of different data tables. And then splicing the processed third data table. The policy for processing data distribution consistency includes at least one of:

ordering of data;

data binning;

record the number of stored files.

Optionally, the components for performing data table stitching include an upstream component and a downstream component, where the upstream component is a component for processing to generate the third data table, and the downstream component is a component for stitching the third data table. The computer device will transmit the metadata of the policy to the downstream component (cross component) through which the processed third data table is spliced. Wherein the metadata of the policy is used for the downstream component to read the processed third data table using the policy. The above manner may be referred to as Bucket Join. Optionally, the computer device can implement the splicing of the third data table through a hash Join, so that a final result, i.e. 1 table, is obtained by splicing.

The reason for the above processing method is that the downstream component needs to spend a lot of time to process the result table generated by the upstream component, and the main consumption is to ensure the consistency of the partition data distribution among tables. By preprocessing the data table, repeated execution of downstream components can be avoided. Optionally, the computer device may store metadata of the processing policy to hdfs. When the downstream component is used, the data table is read according to the strategy so as to execute the Bucket Join. In the effect of improving the processing speed, the greater the amount of data processed, the more significant the improvement in processing efficiency.

Fig. 7 is a schematic diagram illustrating a data table processing and splicing process according to an exemplary embodiment of the present application. As shown in fig. 7, the computer device processes the first data table and k second data tables through the upstream component 701 to obtain a third data table. And performs data distribution consistency processing on the third data table and transmits metadata of the processing policy to the downstream component 702. The downstream component 702 reads the third data table through the processing policy, and splices the third data table in a socket Join mode, so as to obtain a final splicing result.

In addition, in the case that the first sub data table and the second sub data table having the mapping relationship exist in the k second sub data tables corresponding to each of the p first sub data tables and the p time periods, the computer device may splice the first sub data table and the second sub data table having the mapping relationship at the same time. Optionally, the mapping relationship includes a correspondence relationship in a time dimension, a correspondence relationship in a data type, and the like. The reason is that p×k splicing tasks are generated for the splicing mode without mapping relation, so that a large amount of time consumption for task scheduling is brought, input/output (I/O) consumption is repeated, and the data table is read repeatedly. And establishing a mapping relation for the first sub data table and the second sub data table, so that the merging data table is spliced. For example, a mapping relation of a time dimension is constructed through meta information of the first sub data table and the second sub data table, and the data tables with the mapping relation are spliced, namely conditional join is executed. Therefore, the number of splicing tasks is reduced, and excessive data I/O consumption is avoided.

Fig. 8 is a schematic diagram illustrating a splicing process of a data table according to an exemplary embodiment of the present application. As shown in fig. 8, there is a mapping relationship between the sample time 801 of the first sub data table and the characteristic time 802 of the second sub data table. According to the mapping relation, the computer equipment simultaneously splices the first sub-data table with the sample time 801 of 7 and the second sub-data table with the characteristic time 802 of 3, 4 and 5. The first sub-data table with sample time 801 of 8 and the second sub-data tables with characteristic times 802 of 4, 6 and 7 are spliced at the same time. And simultaneously splicing the first sub-data table with the sample time 801 of 9 and the second sub-data tables with the characteristic times 802 of 6, 7 and 8.

Compared with the traditional scheme of data table splicing, the method provided by the embodiment of the application provides a new framework (shown in fig. 2) oriented to a general scene for improving the overall efficiency. The method mainly comprises four core optimization points, namely frame optimization, and large splicing tasks are disassembled into a plurality of small tasks to be executed. The second point is to promote the execution efficiency by compressing and filtering, and optimize the splicing by dynamically acquiring the filtered data table. The third point is that the storage layer storing the splicing result improves the data processing efficiency of the downstream component in a socket mode. Finally, the optimization is made for the specification of the data layer data and the business logic. In addition, the divide-and-conquer processing (multi-line Cheng Pinjie) adopted by the method provided by the embodiment of the application avoids large-scale and disposable data processing calculation, so that the calculation can be completed under the condition of small-scale calculation resources. The requirements for cluster size, number of processors, memory consumption are all reduced. In addition, the method provided by the embodiment of the application can be also applied to data splicing in a non-machine learning scene. For example, multi-table association performed in a data analysis scenario, and the like.

According to the method provided by the embodiment, the second data table is split and spliced with the first sub data table, so that the data table is spliced in a multithreading manner, and the efficiency of the data table splicing is improved.

According to the method provided by the embodiment, the target first sub-data table and the k target second sub-data tables are spliced respectively, so that the reasonable splitting of the splicing task is realized, the splicing of the data tables is supported in a multithreading manner, and the efficiency of splicing the data tables is improved.

According to the method provided by the embodiment, the data table is compressed and calculated, so that the calculation scale of the data table splicing calculation can be reduced, and the data table splicing efficiency is improved.

The method provided by the embodiment can ensure the consistency of the data between the spliced data tables by processing the first sub-data table and the second sub-data table in the same way.

The method provided by the embodiment also realizes flexible compression calculation of the data table by determining the mode of compression calculation of the data table according to the data type.

The method provided by the embodiment also carries out splicing of the data table by adaptively determining the connection operator. And the data table is flexibly spliced according to actual conditions.

According to the method provided by the embodiment, the splicing check point data is stored, so that the splicing progress can be restored to continue splicing under the condition of unexpected interruption of splicing.

According to the method provided by the embodiment, the data distribution consistency is further processed on the data table, so that the efficiency of the follow-up splicing of the data table is improved.

The method provided by the embodiment also ensures the consistency of data distribution in the data table through different processing strategies and realizes flexible processing of the data table.

According to the method provided by the embodiment, the metadata of the strategy is transmitted to the downstream component, so that the downstream component can efficiently read the data table, and the efficiency of splicing the data table is improved.

According to the method provided by the embodiment, the splicing task of the data table can be reduced by splicing the data table according to the mapping relation, so that excessive data I/O consumption is avoided.

The method provided by the embodiment also provides a mode for realizing data table splicing, which can be applied to model training scenes, by splicing the sample table and the feature table.

It should be noted that, the information (including but not limited to user equipment information, user personal information, etc.), data (including but not limited to data for analysis, stored data, presented data, etc.), and signals related to the present application are all authorized by the user or are fully authorized by the parties, and the collection, use, and processing of the related data is required to comply with the relevant laws and regulations and standards of the relevant countries and regions. For example, the data in the first data table and the data in the second data table, etc. related to the present application are acquired under the condition of sufficient authorization.

It should be noted that, the sequence of the steps of the method provided in the embodiment of the present application may be appropriately adjusted, the steps may also be increased or decreased according to the situation, and any method that is easily conceivable to be changed by those skilled in the art within the technical scope of the present disclosure should be covered within the protection scope of the present disclosure, so that no further description is given.

Fig. 9 is a schematic structural diagram of a splicing device for data tables according to an exemplary embodiment of the present application. The apparatus may be used in a computer device. As shown in fig. 9, the apparatus includes:

The acquiring module 901 is configured to acquire a first data table and k second data tables during a processing process of streaming data, where the first data table and the second data tables store data of a same object acquired according to a time dimension, and k is a positive integer.

The splitting module 902 is configured to split the first data table according to p time periods to obtain p first sub data tables, where p is a positive integer greater than 1.

The splicing module 903 is configured to splice, according to the p time periods, data belonging to the same object in the p first sub-data tables and the k second data tables, so as to obtain p×k third data tables.

The splicing module 903 is further configured to splice p×k third data tables.

In an optional design, the splitting module 902 is configured to split each of the k second data tables according to p time periods, to obtain k second sub-data tables corresponding to each of the p time periods. The splicing module 903 is configured to splice the target first sub-data table corresponding to the ith period in the p periods and the k target second sub-data tables corresponding to the mth period in the p periods, where i and m are positive integers not greater than p, and i and m are the same or different.

In an optional design, the stitching module 903 is configured to stitch the target first sub-data table corresponding to the i-th period of the p periods with each of the k target second sub-data tables corresponding to the m-th period of the p periods.

In an alternative design, as shown in fig. 10, the apparatus further comprises:

and the compression module 904 is configured to perform compression calculation on the p first sub-data tables respectively, so as to obtain p first compressed data tables. The compression module 904 is further configured to perform compression calculation on the k second data tables, to obtain k second compressed data tables. And the splicing module 903 is configured to splice, according to the p time periods, data belonging to the same object in the p first compressed data tables and the k second compressed data tables.

In an alternative design, compression module 904 is configured to:

and according to the data type of the data in the target first sub-data table corresponding to the ith period in the p periods, performing compression calculation on the target first sub-data table to obtain a first compression data table corresponding to the ith period in the p periods, wherein i is a positive integer not more than p. And respectively performing compression calculation on the second data tables spliced with the target first sub data in the k second data tables by using the same compression calculation mode as the target first sub data table.

In an alternative design, compression module 904 is configured to:

and under the condition that the data type is an integer type or a large integer type, processing the target first sub-data table by using a processing mode of the efficient compression bitmap to obtain a first compression data table corresponding to the ith period in p periods. Or under the condition that the data type is the character string type, using a bloom filter to process the target first sub-data table to obtain a first compressed data table corresponding to the ith period in the p periods.

In an alternative design, the stitching module 903 is configured to:

and determining a target connection operator from the plurality of connection operators according to the characteristics of the target first sub-data table and the k target second sub-data tables. And splicing the first target sub-data table with the k second target sub-data tables by using a target connection operator.

In an alternative design, as shown in fig. 11, the apparatus further comprises:

and the storage module 905 is configured to store the splice check point data in a process of splicing the p first sub-data tables and the k second data tables. The splicing check point data are used for reflecting current splicing progress records of the p first sub-data tables and the k second data tables, and the splicing progress records are used for recovering current splicing progress of the p first sub-data tables and the k second data tables under the condition of splicing interruption.

In an alternative design, as shown in fig. 12, the apparatus further comprises:

and a processing module 906, configured to perform data distribution consistency processing on the third data table, where the data distribution consistency processing is used to perform data alignment and data ordering of different data tables. And the splicing module 903 is configured to splice the processed third data table.

In an alternative design, the policy for handling data distribution consistency includes at least one of:

sorting data;

data is divided into barrels;

the number of the stored files is recorded.

In an alternative design, the components for performing the data table stitching include an upstream component that is a component for processing to generate the third data table and a downstream component that is a component for stitching the third data table. A splicing module 903, configured to:

metadata of the policy is transmitted to the downstream component. And splicing the processed third data table through the downstream component. Wherein the metadata of the policy is used for the downstream component to read the processed third data table using the policy.

In an alternative design, the stitching module 903 is configured to:

and in the k second sub-data tables corresponding to each of the p first sub-data tables and the p time periods, under the condition that the first sub-data table and the second sub-data table with the mapping relation exist, the first sub-data table and the second sub-data table with the mapping relation are spliced at the same time.

In an alternative design, the first data table is a sample table and the second data table is a feature table. The sample table is used for storing the classification of the sample object, and the feature table is used for storing the features of the sample object.

It should be noted that: the splicing device of the data table provided in the above embodiment is only exemplified by the division of the above functional modules, and in practical application, the above functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the splicing device of the data table and the splicing method embodiment of the data table provided in the foregoing embodiments belong to the same concept, and specific implementation processes of the splicing device of the data table are detailed in the method embodiment, which is not described herein again.

Embodiments of the present application also provide a computer device comprising: the data table splicing method comprises a processor and a memory, wherein at least one instruction, at least one section of program, a code set or an instruction set is stored in the memory, and is loaded and executed by the processor to realize the data table splicing method provided by the method embodiments.

Optionally, the computer device is a server. Illustratively, fig. 13 is a schematic structural diagram of a computer device according to an exemplary embodiment of the present application.

The computer apparatus 1300 includes a central processing unit (Central Processing Unit, CPU) 1301, a system Memory 1304 including a random access Memory (Random Access Memory, RAM) 1302 and a Read-Only Memory (ROM) 1303, and a system bus 1305 connecting the system Memory 1304 and the central processing unit 1301. The computer device 1300 also includes a basic Input/Output system (I/O) 1306 to facilitate the transfer of information between various devices within the computer device, and a mass storage device 1307 for storing an operating system 1313, application programs 1314, and other program modules 1315.

The basic input/output system 1306 includes a display 1308 for displaying information, and an input device 1309, such as a mouse, keyboard, etc., for a user to input information. Wherein the display 1308 and the input device 1309 are connected to the central processing unit 1301 through an input output controller 1310 connected to the system bus 1305. The basic input/output system 1306 may also include an input/output controller 1310 for receiving and processing input from a keyboard, mouse, or electronic stylus, among a plurality of other devices. Similarly, the input output controller 1310 also provides output to a display screen, a printer, or other type of output device.

The mass storage device 1307 is connected to the central processing unit 1301 through a mass storage controller (not shown) connected to the system bus 1305. The mass storage device 1307 and its associated computer-readable storage media provide non-volatile storage for the computer device 1300. That is, the mass storage device 1307 may include a computer-readable storage medium (not shown) such as a hard disk or a compact disk-Only (CD-ROM) drive.

The computer-readable storage medium may include computer storage media and communication media without loss of generality. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable storage instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, erasable programmable read-Only register (Erasable Programmable Read Only Memory, EPROM), electrically erasable programmable read-Only Memory (EEPROM), flash Memory or other solid state Memory devices, CD-ROM, digital versatile disks (Digital Versatile Disc, DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will recognize that the computer storage medium is not limited to the one described above. The system memory 1304 and mass storage device 1307 described above may be referred to collectively as memory.

The memory stores one or more programs configured to be executed by the one or more central processing units 1301, the one or more programs containing instructions for implementing the above-described method embodiments, the central processing unit 1301 executing the one or more programs to implement the methods provided by the various method embodiments described above.

According to various embodiments of the application, the computer device 1300 may also operate by a remote computer device connected to the network through a network, such as the Internet. I.e., the computer device 1300 may be connected to the network 1312 via a network interface unit 1311 coupled to the system bus 1305, or alternatively, the network interface unit 1311 may be used to connect to other types of networks or remote computer device systems (not shown).

The memory also includes one or more programs stored in the memory, the one or more programs including steps for performing the methods provided by the embodiments of the present application, as performed by the computer device.

The embodiment of the application also provides a computer readable storage medium, wherein at least one instruction, at least one section of program, code set or instruction set is stored in the readable storage medium, and when the at least one instruction, the at least one section of program, the code set or the instruction set is loaded and executed by a processor of computer equipment, the method for splicing the data table provided by the embodiment of the method is realized.

The present application also provides a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions, so that the computer device executes the splicing method of the data table provided by each method embodiment.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the above readable storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The foregoing description of the preferred embodiments of the application is not intended to limit the application to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and principles of the application are intended to be included within the scope of the application.

Claims

1. A method for splicing a data table, the method comprising:

And splicing the p x k third data tables.

2. The method of claim 1, wherein the concatenating the p first sub-tables and the data belonging to the same object in the k second data tables according to the p periods to obtain p×k third data tables includes:

splitting each second data table in the k second data tables according to the p time periods to obtain k second sub-data tables corresponding to each time period in the p time periods;

and splicing the target first sub-data table corresponding to the ith period in the p periods and the k target second sub-data tables corresponding to the mth period in the p periods, wherein i and m are positive integers not greater than p, and i and m are identical or different.

3. The method of claim 2, wherein concatenating the target first sub-data table corresponding to the i-th period of the p periods with the k target second sub-data tables corresponding to the m-th period of the p periods, comprises:

4. A method according to any one of claims 1 to 3, wherein the method further comprises:

respectively carrying out compression calculation on the p first sub-data tables to obtain p first compression data tables;

respectively carrying out compression calculation on the k second data tables to obtain k second compressed data tables;

and according to the p time periods, splicing the data belonging to the same object in the p first sub-data tables and the k second data tables to obtain p third data tables, including:

and according to the p time periods, splicing the data belonging to the same object in the p first compressed data tables and the k second compressed data tables.

5. The method of claim 4, wherein the compressing the p first sub-tables to obtain p first compressed data tables includes:

And respectively performing compression calculation on the k second data tables to obtain k second compressed data tables, wherein the method comprises the following steps:

6. The method according to claim 5, wherein the compressing the target first sub-data table according to the data type of the data in the target first sub-data table corresponding to the i-th period of the p periods to obtain the first compressed data table corresponding to the i-th period of the p periods includes:

or alternatively, the first and second heat exchangers may be,

7. The method of claim 2, wherein concatenating the target first sub-data table corresponding to the i-th period of the p periods with the k target second sub-data tables corresponding to the m-th period of the p periods, comprises:

8. A method according to any one of claims 1 to 3, wherein the method further comprises:

in the process of splicing the p first sub-data tables and the k second data tables, splicing check point data are stored;

9. A method according to any one of claims 1 to 3, wherein said concatenating said p x k third data tables comprises:

Performing data distribution consistency processing on the third data table, wherein the data distribution consistency processing is used for performing at least one of data alignment and data ordering of different data tables;

and splicing the processed third data table.

10. The method of claim 9, wherein the policy of processing of data distribution consistency comprises at least one of:

sorting data;

data is divided into barrels;

the number of the stored files is recorded.

11. The method of claim 10, wherein the components for performing a data table splice comprise an upstream component that is a component for processing to generate the third data table and a downstream component that is a component for splicing the third data table;

the splicing the processed third data table comprises the following steps:

transmitting metadata of the policy to a downstream component;

splicing the processed third data table through the downstream component;

12. The method according to claim 2, wherein the method further comprises:

13. A method according to any one of claims 1 to 3, wherein the first data table is a sample table and the second data table is a feature table;

14. A splice device for a data sheet, the device comprising:

15. A computer device comprising a processor and a memory, wherein the memory stores at least one instruction, at least one program, a set of codes, or a set of instructions, the at least one instruction, the at least one program, the set of codes, or the set of instructions being loaded and executed by the processor to implement a method of splicing data tables according to any one of claims 1 to 13.

16. A computer readable storage medium having stored therein at least one instruction, at least one program, code set, or instruction set, the at least one instruction, the at least one program, the code set, or instruction set being loaded and executed by a processor to implement a method of splicing data tables according to any of claims 1 to 13.

17. A computer program product, characterized in that it comprises computer instructions stored in a computer-readable storage medium, from which computer instructions a processor of a computer device reads, which processor executes the computer instructions, so that the computer device performs the method of splicing data tables according to any of claims 1 to 13.