CN114428786A - Data processing method and device for distributed pipeline and storage medium - Google Patents

Data processing method and device for distributed pipeline and storage medium Download PDF

Info

Publication number
CN114428786A
CN114428786A CN202111537765.5A CN202111537765A CN114428786A CN 114428786 A CN114428786 A CN 114428786A CN 202111537765 A CN202111537765 A CN 202111537765A CN 114428786 A CN114428786 A CN 114428786A
Authority
CN
China
Prior art keywords
data
compression
data processing
task
compression level
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111537765.5A
Other languages
Chinese (zh)
Inventor
方祝和
刘奇
黄东旭
崔秋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Pingkai Star Beijing Technology Co ltd
Original Assignee
Pingkai Star Beijing Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Pingkai Star Beijing Technology Co ltd filed Critical Pingkai Star Beijing Technology Co ltd
Priority to CN202111537765.5A priority Critical patent/CN114428786A/en
Publication of CN114428786A publication Critical patent/CN114428786A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The embodiment of the application provides a data processing method and device for a distributed assembly line, electronic equipment and a storage medium, relates to the technical field of databases, and is suitable for a cross-node multitask data exchange scene in an MPP database. The method comprises the steps of creating an asynchronously executed data processing thread and a data transmission thread; compressed data generated by an upstream task is obtained through a data transmission thread; decompressing compressed data of an upstream task and executing a current task through a data processing thread, determining and compressing data generated by the current task according to a current target compression algorithm, and storing the compressed data generated by the current task to a sending buffer area; and sending the data in the sending buffer to the corresponding downstream task through the data transmission thread. According to the embodiment of the application, the calculation power of the nodes can be fully utilized, the dynamic balance of production data and sending data is realized, the throughput and the resource utilization rate of the whole assembly line are improved, and the time consumed by query execution is reduced.

Description

Data processing method and device for distributed pipeline and storage medium
Technical Field
The present application relates to the field of database technologies, and in particular, to a data processing method and apparatus for a distributed pipeline, and a storage medium.
Background
In order to analyze the big data in real time and extract the value of the big data, a Massive Parallel Processing (MPP) database system is widely used, for example, SparkSQL, Impala, greenplus, etc. The MPP database system operates in clusters, one cluster includes a plurality of physical machines and is connected through a network. To take full advantage of CPU and network resources to improve performance, many MPP databases employ distributed pipelining to handle tasks. These tasks involve multiple machines working together across the network, with each small piece of data produced by an upstream sub-task being sent through the network to downstream sub-task processing, and processing in this pipelined fashion until the data processing of the data source is complete. The distributed pipeline technology simultaneously utilizes the CPU and the network, avoids materialized intermediate results and reduces the task completion time.
However, with the innovation of software and hardware technologies, the distributed pipeline faces a problem of idling of a large amount of CPU resources. From a software perspective of analysis, MPP databases are increasingly using columnar execution engines, i.e., processing a subset of the data one column at a time, rather than by row. The column-type execution engine can fully utilize the CPU cache, reduce cache misses (cache miss), reduce the cost of code interpretation execution and improve the data processing speed. On the other hand, CPUs have been developed in multi-core and many-core directions to improve computing capabilities. Together, these two aspects reduce the time consumption of the computation portion in the distributed pipeline by several times. However, a ten-trillion network is widely used in a database cluster at present, and the network transmission data speed of the ten-trillion network is only one tenth of the data processing speed of a CPU (central processing unit). The upstream subtask in the distributed pipeline must wait for the network to finish transmitting data, and the other downstream subtask also waits for the network to transmit data for processing. In the process of waiting for network transmission of the upstream and downstream subtasks, the CPU resources can only be in an idle state, which causes waste.
Disclosure of Invention
Embodiments of the present application provide an ETA prediction, training, presentation method, apparatus, electronic device, and storage medium that overcome the above-mentioned problems or at least partially solve the above-mentioned problems.
In a first aspect, a method, an apparatus, and a storage medium for data processing in a distributed pipeline.
In a first aspect, a data processing method for a distributed pipeline is provided, including:
establishing an asynchronously executed data processing thread and a data transmission thread, wherein the number of the data processing thread and the data transmission thread is more than one;
obtaining compressed data generated by an upstream task through the data transmission thread;
decompressing the compressed data of the upstream task and executing the current task through the data processing thread, determining and compressing the data generated by the current task according to the current target compression algorithm, and storing the compressed data generated by the current task to a sending buffer area;
and sending the data in the sending buffer area to a corresponding downstream task through the data transmission thread.
In a second aspect, there is provided a data processing apparatus comprising:
the thread creating module is used for creating an asynchronously executed data processing thread and a data transmission thread, and the number of the data processing thread and the data transmission thread is not more than one;
the first transmission module is used for acquiring compressed data generated by an upstream task through the data transmission thread;
the data processing module is used for decompressing the compressed data of the upstream task and executing the current task through the data processing thread, determining and compressing the data generated by the current task according to the current target compression algorithm, and storing the compressed data generated by the current task to a sending buffer area;
and the second transmission module is used for sending the data in the sending buffer area to the corresponding downstream task through the data transmission thread.
In a third aspect, an embodiment of the present application provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the steps of the method provided in the first aspect are implemented.
In a fourth aspect, embodiments of the present application provide a computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, implements the steps of the method as provided in the first aspect.
In a fifth aspect, the present application provides a computer program product, which includes computer instructions stored in a computer-readable storage medium, and when a processor of a computer device reads the computer instructions from the computer-readable storage medium, the processor executes the computer instructions, so that the computer device executes the steps of implementing the method as provided in the first aspect.
According to the data processing method, the data processing device, the electronic equipment and the storage medium, through the creation of at least one data processing thread and at least one data transmission thread which are processed asynchronously, compressed data generated by an upstream task is obtained through the data transmission thread, the compressed data of the upstream task is decompressed by the data processing thread and the current task is executed, the data generated by the current task is further compressed by a current target compression algorithm, the compressed data is stored to a sending buffer area, and the data in the sending buffer area is sent to a corresponding downstream task through the data transmission thread.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments of the present application will be briefly described below.
Fig. 1 is a technical architecture diagram of a data processing method according to an embodiment of the present application;
FIG. 2 is a schematic diagram of a data interaction process between nodes according to an embodiment of the present application;
fig. 3 is a schematic flowchart of a data processing method according to an embodiment of the present application;
FIG. 4 is a schematic flow chart illustrating a target compression algorithm according to an embodiment of the present disclosure;
fig. 5 is a schematic flowchart of a data processing method according to another embodiment of the present application;
fig. 6 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application;
fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
Reference will now be made in detail to the embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application.
As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.
To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
Several terms referred to in this application will first be introduced and explained:
the operation of the Hash aggregation operator in the MPP database is divided into two steps: in the first step, local results are calculated on each machine according to local data, and then the data of the local results are transmitted to different machines according to distribution attributes to calculate final results. The speed of the task for calculating the local result of the local data can reach 300MB/s/cpu core, and if 20 threads are distributed, the total speed of the task on one machine can reach 6 GB/s. But the theoretical speed output over a ten-gigabit network is only 1.25 GB/s. Because the intermediate result is not cached in the pipelined execution, after the task of the local CPU calculating the local result calculates a few small blocks of data, the local CPU waits for the network to transmit the small blocks of data to the target machine, and then the local CPU only stops calculating and waits for the network I/O, so that the distributed local CPU utilization rate is only 1.25/6. The downstream local CPU also needs to wait for the upstream local result to calculate the final result, and has to wait for the network data input to be completed, thereby wasting the downstream local CPU more seriously.
This CPU waste further limits the scalability of the cluster in which the MPP database resides. Because of the challenge of large data, it is expected that joining a new machine will reduce the time consumption of queries, and that the revenue generated will grow linearly in synchronization with the joining new machine. Although adding a new machine can increase computing power, adding a new machine cannot accelerate the analytic query speed because network transmission capability is not improved. On the other hand, as the data volume is still increasing, once the computing capacity of the current cluster is exceeded, a new machine needs to be added, but the network data transmission speed cannot be accelerated, so that the query of the data which becomes large is slower and slower.
In summary, the query performance and scalability of the MPP database is severely limited by the network transmission speed. In order to solve the problem, the embodiment of the application introduces compression into the distributed pipeline, reduces the data transmission quantity and improves the processing speed of the distributed pipeline.
The following describes the technical solutions of the present application and how to solve the above technical problems with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.
Referring to fig. 1, which schematically shows a technical architecture diagram of a data processing method according to an embodiment of the present application, as shown in the figure, an MPP database includes a machine (also referred to as a node) 1, a machine 2, and … machines n, where n is a natural number greater than or equal to 2, and the number of nodes may be adaptively adjusted according to the calculated amount of processing tasks in practical applications, and the embodiment of the present application is not particularly limited. Each query is split into a plurality of tasks to be executed on the nodes, and it should be noted that the upstream and downstream tasks are not necessarily executed by different nodes, and may be executed by the same node. The nodes receive the data processed by the upstream task, process the data processed by the upstream task to obtain the data processed by the current task, and then continuously send the data processed by the current task to the downstream task of the node.
In the structure shown in fig. 1, a query statement is divided into a plurality of pipelines for execution, each pipeline is divided into a plurality of plan segments according to a preset rule, for example, whether to be re-partitioned (i.e., to be distributed to different nodes for processing), and each plan segment is run by using data fragments on different nodes, which is called a task. Data is exchanged between each task through the nodes. The node has a function of encapsulating data transmission, and is further divided into a sending end and a receiving end. Each task can obtain input data from the receiving end of the node for processing, then carries out calculation, and sends the calculation result to the downstream task through the sending end.
The node of the embodiment of the application further introduces a compression module and a machine learning module on the basis of a task scheduling module, a task execution module and a network service module, wherein the task scheduling module is used for determining tasks to be executed by the node, for example, determining an upstream node and a downstream node of the node, so as to determine a data circulation object, the task execution module is specifically used for executing data processing, namely processing data sent by the upstream node, and sending the processed data to the downstream node through the network service module. The compression module is used for managing specific compression algorithms and compression levels, and the machine learning module can provide suggestions of proper compression algorithms/levels for task execution according to data characteristics and characteristics of query sentences.
The embodiment of the present application introduces a compression function into a query task, please refer to fig. 2, which exemplarily shows a schematic diagram of a data exchange flow between different nodes for upstream and downstream tasks according to the embodiment of the present application, as shown in the figure, data is exchanged between the tasks through a data exchange operator via a network, and for each upstream and downstream task, a send/receive buffer is introduced between a work thread and a thread that sends/receives data, so that data processing work and data sending work can be performed asynchronously. And after the sending end of the data exchange operator takes the data of the calculation result of the upstream task, compressing the data, and then putting the data into a sending buffer area. The sending thread obtains a block of data from a sending buffer zone of the upstream task corresponding to the target address, and then sends the block of data to a receiving end of the downstream task corresponding to the target address. After receiving the data, the receiving end of the downstream task firstly puts the data into a receiving buffer area, and then decompresses the data from the receiving buffer area and carries out calculation. A node can run multiple of the upstream and downstream tasks simultaneously.
Referring to fig. 3, a schematic flowchart of a data processing method according to an embodiment of the present application is exemplarily shown, and as shown in fig. 3, the method includes:
s101, establishing asynchronous execution data processing threads and data transmission threads, wherein the number of the data processing threads and the number of the data transmission threads are not more than one.
As can be seen from fig. 2, the node according to the embodiment of the present application may be provided with a data processing thread and a data transmission thread which are processed in parallel and executed asynchronously, and the number of the data processing thread and the number of the data transmission thread are not limited to one.
S102, compressed data generated by an upstream task is received and obtained through a data transmission thread;
it should be understood that, when the upstream task is executed by another node, the data transmission thread of the current task is in communication connection with the data transmission thread of the other node, the data transmission thread of the other node sends the compressed data generated by the upstream task, and the data transmission thread of the current task receives the compressed data generated by the upstream task. When the upstream task and the current task are both executed by the same node, the data transmission thread acquires compressed data generated by the upstream task from a sending cache region of the node.
S103, decompressing the compressed data of the upstream task and executing the current task through the data processing thread, determining and compressing the data generated by the current task according to the current target compression algorithm, and storing the compressed data generated by the current task to a sending buffer area.
The data processing thread obtains compressed data generated by the upstream task from the data transmission thread, firstly decompresses the compressed data, and then processes a decompression result according to the processing rule of the current task, so as to butterfly the data generated by the current task.
Because the computational power (measurable by the resources of the CPU in the node) required for processing different data by the node is different, and the data processing thread and the data transmission thread in the node both need to consume the computational power, the residual computational power of the node in the embodiment of the present application is dynamically changed, and the data amount in the sending buffer is also dynamically changed, so that the embodiment of the present application provides a method for adaptively selecting a target compression algorithm, and compresses the data of the current task (i.e., the operational result of the current task) by using the target compression algorithm, thereby better balancing the computational power and the sending buffer.
And S104, sending the data in the sending buffer area to the corresponding downstream task through the data transmission thread.
According to the data processing method and device, the data which are processed by the upstream node are received through the data transmission thread and stored in the receiving buffer area, the data are read from the receiving buffer area by the data processing thread to be decompressed, the preset calculation task is executed, the processed data are compressed, the compressed data are stored in the sending buffer area, and the data are read from the sending buffer area by the sending thread and sent to the downstream task.
In the embodiment of the present application, the ordering of the compressed data in the sending buffer may be determined according to the time when the compression is completed, and an optional ordering method is as follows: the earlier the data compression is completed, the earlier the ordering in the transmission buffer is, and the priority can be given to the nodes processing the downstream tasks.
It should be noted that, because the data processing thread and the data transmission thread are processed asynchronously, steps S102 and S103 are executed asynchronously, and there may be no fixed precedence relationship between steps S102 and S103.
Aiming at the problems of CPU waste and expansibility limitation faced by a distributed pipeline, the embodiment of the application introduces data compression in the data block transmission process, and reduces the network transmission quantity. Namely, after calculating the data, each task compresses the data, then transmits the data to the downstream task through the network, and the downstream task decompresses the data after receiving the compressed data block and then continues the following tasks.
According to the data processing method, compressed data generated by an upstream task is obtained through a data transmission thread, the compressed data of the upstream task is decompressed by the data processing thread, the current task is executed, a current target compression algorithm is further determined to compress the data generated by the current task, the compressed data is stored in a sending buffer area, and the data in the sending buffer area is sent to a corresponding downstream task through the data transmission thread.
On the basis of the above embodiments, as an optional embodiment, the determining the current target compression algorithm includes:
if the saturation degree of the sending buffer area does not exceed a preset range, taking a compression algorithm used by the data processing thread to compress the previous block of data as a current target compression algorithm;
if the saturation degree of the sending buffer area exceeds a preset range, determining a current target compression algorithm according to a pre-established compression level list and a compression algorithm used by the data processing thread for compressing the previous block of data;
wherein the compression level list includes compression algorithms for at least two compression levels of the current task.
It should be understood that, a data processing thread may continuously obtain data generated by an upstream task, perform calculation, and determine a target compression algorithm to compress after calculation, so that, each time the target compression algorithm is determined, the saturation level of the sending buffer may be used as a basis, if the saturation level of the sending buffer does not exceed a preset range, the compression algorithm used for processing the previous data is continuously used, and if the saturation level exceeds the preset range, the current target compression algorithm is obtained by combining the compression level list based on the compression algorithm for processing the previous data.
The compression level list of the embodiment of the application comprises compression algorithms of at least two compression levels for the current task. It should be understood that different compression levels may have different degrees of compression on the data, and in an alternative embodiment of the present application, the higher the compression level, the higher the degree of compression on the data by the corresponding compression level.
It can be understood that when the node starts to operate, the sending buffer of the node is certainly not saturated, and at this time, the initial compression algorithm is continuously used, and as the data processing amount of the node increases, the sending buffer is gradually saturated, so that the compression algorithm is continuously adjusted according to the saturation degree of the sending buffer and the compression level list.
The embodiment of the application selects a target compression algorithm and a corresponding compression capability, and particularly relates to solving the following three problems:
(1) the compression function is expected to improve the data compression capacity on the premise of utilizing less calculation power cost, because the compressed data can also consume extra calculation power and further compete with the original calculation task in the MPP database for calculation power, if the former reduces the overall data production speed of the task and cannot fully utilize the bandwidth of a network, the introduction of the compression function only plays a negative role;
(2) how to select a proper compression algorithm, because the access data characteristics are different due to different data types related to the query, the proper compression algorithm needs to be selected according to the data accessed by the current query, and the data compression rate is improved;
(3) there is also a need to accommodate tasks at different computation speeds, since the production speed of the tasks in the MPP database varies from node to node, data type, and data distribution, and the compression capability cannot be constant. In extreme cases, a certain computing task consumes a CPU, and data produced by it can be transmitted in time, which does not require the introduction of data compression.
On the basis of the foregoing embodiments, as an optional embodiment, determining a current target compression algorithm according to a pre-established compression list and a compression algorithm used by the data processing thread to compress the previous block of data, includes:
determining a reference compression level corresponding to a compression algorithm used by the data processing thread to compress the last block of data in the compression level list;
and determining the adjusted compression level from the compression level list as a current target compression level according to the saturation degree of the sending buffer area and the reference compression level, and determining the current target compression algorithm according to the current target compression level.
Specifically, if the saturation degree of the sending buffer is lower than the lower limit value of the preset range, determining, from the compression level list, a compression level at which the data compression degree is lower than the reference compression level as the adjusted compression level;
and if the saturation degree of the sending buffer area is higher than the upper limit value of the preset range, determining a compression level of which the data compression degree is higher than the reference compression level from the compression level list as the adjusted compression level.
It should be noted that the compression level and the compression degree of the embodiment of the present application are proportional, that is, the higher the compression level is, the greater the compression degree of the data is, and the higher the calculation force is correspondingly required. When the compression level is increased, one compression level can be increased at a time, and a plurality of compression levels can also be increased at a time.
According to the embodiment of the application, multiple compression levels can be preset according to the calculation force required by compression and/or the compression degree of data, the predetermined compression algorithm is classified into different compression levels, and different compression algorithms are measured by using a uniform compression level standard. Thus, after the target compression level is determined, one of the compression algorithms corresponding to the compression level can be selected as the current target compression algorithm.
For columnar data, there are lightweight compression and heavyweight compression specifically:
the lightweight compression is based on different types of data characteristics, and is further divided into logic compression and physical compression. The logic compression is compression according to the logic semanteme of data, such as dictionary compression; physical compression is further compression from the binary of the physical representation of the data.
Heavyweight compression is based on binary string compression.
For example, the first compression level is logical compression or physical compression, and the second compression level is further physical compression after logical compression, further logical compression after physical compression, or heavyweight compression; the third compression level is a heavyweight compression level further added to the second compression level. The progressive relationship can be represented by a uniform compression level, and the compression level can be modeled according to different data types, data distribution and other characteristics and stored in a compression module. It is understood that the higher the compression level, the higher the compression rate, but the more computationally intensive the compression will consume.
On the basis of the foregoing embodiments, as an optional embodiment, determining, from the compression level list, a compression level at which the data compression degree is lower than the reference compression level includes:
reducing the compression level step by step, and judging whether the saturation degree of the sending buffer area is lower than the lower limit value of a preset range after the preset time after the compression level is reduced; and if the saturation degree of the sending buffer area is lower than the lower limit value of the preset range, continuing to reduce the compression level until the saturation degree of the sending buffer area is not lower than the lower limit value of the preset range. And if the saturation degree of the sending buffer zone is not lower than the lower limit value of the preset range, keeping the current compression level unchanged.
Determining a compression level from the list of compression levels at which the degree of data compression is higher than the reference compression level, comprising:
increasing the compression level step by step, and judging whether the saturation degree of the sending buffer area is higher than the upper limit value of the preset range after the preset time after the compression level is increased; and if the saturation degree of the sending buffer area is higher than the upper limit value of the preset range, continuously increasing the compression level until the saturation degree of the sending buffer area is not higher than the upper limit value of the preset range. And if the saturation degree of the sending buffer zone is not higher than the upper limit value of the preset range, keeping the current compression level unchanged.
Compared with the mode of increasing/reducing the compression level by steps, the mode of increasing/reducing the compression level by steps can reduce the amplitude of computational power jolt.
On the basis of the foregoing embodiments, as an alternative embodiment, the method further includes the step of creating the compression level list:
determining the data characteristics and the query characteristics of the current task, inputting the data characteristics and the query characteristics into a pre-trained neural network model, and obtaining a compression level list of the current task;
the neural network model is trained by taking data characteristics and query characteristics of a sample task as training samples and at least taking a compression level list of the sample task as a training label.
Firstly, collecting a certain amount of sample data processing logs, recording data characteristics, query characteristics and a compression level list of processed data in the sample data processing logs, wherein the query characteristics comprise a data source and a query sentence corresponding to the current task, then taking the data characteristics and the query characteristics as training samples and the corresponding compression level list as training labels, and training an initial model to obtain a compression algorithm prediction model. The initial model may be a single neural network model or a combination of a plurality of neural network models. It should be noted that the sample data processing log selected in the embodiment of the present application may have a saturation degree of the recorded sending buffer meeting a preset condition, for example, the saturation degree may be between 70% and 90%.
The appropriate compression is selected by way of machine learning based on the production rate and available idle computing power of the data computation portion, while taking into account data characteristics such as whether to order, data type, data length, data distribution, etc., and further taking into account whether to order the data to accommodate a particular compression algorithm. That is to say, in the process of adjusting the compression level, the compression algorithm may try to replace a more suitable compression algorithm, that is, on the premise of ensuring the compression rate, the computational power consumption is reduced.
In addition, the compression algorithm prediction model (machine learning module) can map the data characteristics with a specific data table or an intermediate result, and select a proper compression algorithm. In a word, the machine learning method sets appropriate compression algorithm/level according to the learning rule in the historical operating condition and aiming at different scenes. Even if the compression algorithm/level set is not as expected in the new environment, the compression algorithm is still dynamically adjusted according to the transmit buffer. The adjustment can be fed back to the machine learning module, and the accuracy of the machine learning model is improved.
The concrete modeling is to select a suitable compression algorithm level list and initial compression levels, i.e., (data features, query) - > (compression algorithm level list, initial compression algorithm levels), depending on different data features and query features. The data characteristics comprise different data types supported by common databases, such as integers, floating point numbers, character strings, dates and the like; but also maximum/minimum, order, and data distribution (number of different values, repetition degree, and order). The query features include information such as the query statement, the access base table, and intermediate results. These characteristics determine which logical compression to employ and the different compression levels. The compression level is based on the logic compression, and adopts physical level compression, such as bit and byte level further compression, whether other optimization methods are adopted, such as SIMD technology and the like for level numbering, and whether heavy level compression is adopted. For example, for an algorithm like RLE, 5 levels of 0-4 can be set:
0 means no compression algorithm is used;
1 denotes RLE algorithm;
2, further adopting Delta compression on the basis of 1;
3, further adopting a null compression algorithm of a bit hierarchy on the basis of 2, and adding SIMD optimization;
4 represents that a byte-level heavyweight compression algorithm such as Gzip, LZ, snappy, etc. is further adopted on the basis of 3.
The model provided by the embodiment of the application can be reconstructed according to different databases or applications, and the accuracy of the model is verified and updated by periodically acquiring data.
On the basis of the foregoing embodiments, as an optional embodiment, inputting the data features and the query features into a pre-trained neural network model, further includes:
obtaining an initial compression level of a current task;
correspondingly, the training label further comprises an initial compression level of the sample task;
and the compression algorithm corresponding to the initial compression level in the compression level list is at least used as the compression algorithm of the first block of data generated by the current task.
The thrashing phenomenon refers to a situation where two adjacent times of adjusting the compression level up to a preset number of times (e.g., 5 times) are opposite in manner, for example, a certain time is to increase the compression level and the next time is to decrease the compression level or a certain time is to decrease the compression level and the next time is to increase the compression level.
Based on the above embodiments, in order to reduce the possibility that the sending buffer becomes empty due to the thrashing phenomenon, the compression algorithm may adopt the following methods to make the sending buffer always in a high saturation state:
(1) expanding the sending buffer area
That is to say, if it is determined at a certain time that the compression level is increased and it is determined at the next time that the compression level is decreased, or it is determined at a certain time that the compression level is decreased and it is determined at the next time that the compression level is increased, both the two adjacent compression level adjustment modes are opposite, the size of the transmission buffer can be expanded, and the transmission buffer can store more data by increasing the capacity of the transmission buffer, thereby decreasing the probability of thrashing.
(2) Adjusting the proportion of different compression levels in each data processing thread;
because the node can process a plurality of data through a plurality of data processing threads at the same time, each data can determine a corresponding compression level according to the embodiment, when the modes of adjusting the compression levels twice are opposite, the proportion of the compression algorithm of the higher compression level can be increased or reduced according to the actual situation.
(3) Re-determining compression level lists and compression levels through a pre-trained neural network model
During the running process of the task, the node monitors the saturation of the sending buffer all the time, and once the saturation is too high or too low, the compression algorithm/level is adjusted, and the thrashing phenomenon is responded in real time. The compression algorithm also feeds specific adjustment back to a machine learning module as a sample (which may be a positive sample or a negative sample) to optimize a compression algorithm prediction model, can also collect current data characteristics and query characteristics of a task, and inputs the current data characteristics and the query characteristics to a pre-trained neural network model to obtain an updated compression level list and an updated compression level for the current service; and determining a reference compression algorithm corresponding to the updated compression level according to the updated compression level list, and compressing the data according to the reference compression algorithm.
Referring to fig. 4, a schematic flow chart of determining a target compression algorithm according to an embodiment of the present application is exemplarily shown, as shown in the figure:
selecting a suitable compression level list and an initial compression level;
if the task is determined to be abnormal in operation, the process is exited, and the fault reason is checked;
if the task is determined to run normally, further judging whether the saturation degree of the compression buffer area is higher than the upper limit value (for example, 95%) of a preset range;
if the saturation degree of the compression buffer area is higher than the upper limit value of the preset range, the compression level is improved, and whether data bump occurs or not is judged;
if the saturation degree of the compression buffer area is not higher than the upper limit value of the preset range, further judging whether the saturation degree of the compression buffer area is lower than the lower limit value (for example, 10%) of the preset range;
if the saturation degree of the compression buffer area is lower than the lower limit value of the preset range, reducing the compression level and judging whether data bump occurs or not;
if the saturation degree of the compression buffer area is not lower than the lower limit value of the preset range, judging whether data bump occurs or not;
if the data bump is determined, adjusting the size of a sending buffer area, adjusting the proportion of a compression algorithm or updating the level of the compression algorithm, and then updating the data characteristics;
if the data bump is not determined, directly updating the data characteristics;
and feeding back the adjustment result to the neural network model so as to optimize the parameters of the neural network model.
Referring to fig. 5, a schematic flow chart of a data processing method according to another embodiment of the present application is exemplarily shown, and as shown, the method includes:
s201, training a neural network model by taking data characteristics and query characteristics of a sample task as training samples and at least taking a compression level list and an initial compression level of the sample task as training labels;
s202, determining data characteristics and query characteristics of the current task, inputting the data characteristics and the query characteristics to a pre-trained neural network model, and obtaining a compression level list and an initial compression level of the current task;
s203, establishing asynchronous execution data processing threads and data transmission threads, wherein the number of the data processing threads and the number of the data transmission threads are not more than one;
s204, compressed data generated by an upstream task is obtained through a data transmission thread and is placed in a receiving buffer area;
s205, acquiring compressed data of the upstream task from the receiving buffer area through the data processing thread, decompressing and executing the current task;
s206, if the saturation degree of the sending buffer area does not exceed the preset range, taking a compression algorithm used by the data processing thread to compress the previous block of data as a current target compression algorithm; otherwise, determining the current target compression algorithm according to a pre-established compression level list and the compression algorithm used by the data processing thread to compress the previous block of data;
s207, compressing data generated by the current task according to a current target compression algorithm, and storing the compressed data of the current task to a sending buffer area;
and S208, sending the data in the sending buffer to the corresponding downstream task through the data transmission thread.
An embodiment of the present application provides a data processing apparatus of a distributed pipeline, as shown in fig. 6, the apparatus may include: the thread creating module 601, the first transmission module 602, the data processing module 603, and the second transmission module 604 specifically:
a thread creating module 601, configured to create an asynchronously executed data processing thread and a data transmission thread, where the number of the data processing thread and the number of the data transmission thread are more than one;
a first transmission module 602, configured to obtain compressed data generated by an upstream task through the data transmission thread;
the data processing module 603 is configured to decompress the compressed data of the upstream task and execute the current task through the data processing thread, determine and compress the data of the current task according to the current target compression algorithm, and store the compressed data of the current task in the sending buffer;
a second data transmission module 604, configured to receive, through the data transmission thread, data generated by an upstream task and send data in the sending buffer to a corresponding downstream task.
Alternatively, the first transmission module 602 and the second transmission module 604 may be implemented by different modules included in the same transmission device or different operation modes of the same transmission device.
The data processing apparatus provided in the embodiment of the present application specifically executes the processes of the foregoing method embodiments, and for details, the contents of the foregoing data processing method embodiments are not described herein again. The data processing device provided by the embodiment of the application obtains compressed data generated by an upstream task through a data transmission thread, decompresses the compressed data of the upstream task by the data processing thread, executes the current task, further determines that a current target compression algorithm compresses the data generated by the current task, stores the compressed data into a sending buffer area, and sends the data in the sending buffer area to a corresponding downstream task by the data transmission thread.
An embodiment of the present application provides an electronic device, including: a memory and a processor; at least one program stored in the memory for execution by the processor, which when executed by the processor, implements: the method comprises the steps of creating at least one data processing thread and at least one data transmission thread which are processed asynchronously, executing a current task on first data generated by an upstream task through the data processing thread, generating second data, compressing the second data, storing the compressed data into a sending buffer area, receiving the first data generated by the upstream task through the data transmission thread and sending the data in the sending buffer area to a corresponding downstream task, and therefore the throughput and the resource utilization rate of nodes can be improved, and the time consumed by query execution is reduced.
In an alternative embodiment, an electronic device is provided, as shown in fig. 7, the electronic device 4000 shown in fig. 7 comprising: a processor 4001 and a memory 4003. Processor 4001 is coupled to memory 4003, such as via bus 4002. Optionally, the electronic device 4000 may further comprise a transceiver 4004. In addition, the transceiver 4004 is not limited to one in practical applications, and the structure of the electronic device 4000 is not limited to the embodiment of the present application.
The Processor 4001 may be a CPU (Central Processing Unit), a general-purpose Processor, a DSP (Digital Signal Processor), an ASIC (Application Specific Integrated Circuit), an FPGA (Field Programmable Gate Array) or other Programmable logic device, a transistor logic device, a hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 4001 may also be a combination that performs a computational function, including, for example, a combination of one or more microprocessors, a combination of a DSP and a microprocessor, or the like.
Bus 4002 may include a path that carries information between the aforementioned components. The bus 4002 may be a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The bus 4002 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 7, but this is not intended to represent only one bus or type of bus.
The Memory 4003 may be a ROM (Read Only Memory) or other types of static storage devices that can store static information and instructions, a RAM (Random Access Memory) or other types of dynamic storage devices that can store information and instructions, an EEPROM (Electrically Erasable Programmable Read Only Memory), a CD-ROM (Compact Disc Read Only Memory) or other optical Disc storage, optical Disc storage (including Compact Disc, laser Disc, optical Disc, digital versatile Disc, blu-ray Disc, etc.), a magnetic Disc storage medium or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to these.
The memory 4003 is used for storing application codes for executing the scheme of the present application, and the execution is controlled by the processor 4001. Processor 4001 is configured to execute application code stored in memory 4003 to implement what is shown in the foregoing method embodiments.
The embodiment of the present application provides a computer readable storage medium, on which a computer program is stored, and when the computer program runs on a computer, the computer is enabled to execute the corresponding content in the foregoing method embodiment. Compared with the prior art, the compressed data generated by the upstream task is obtained through the data transmission thread, the compressed data of the upstream task is decompressed through the data processing thread, the current task is executed, the data generated by the current task is further compressed through the current target compression algorithm, the compressed data is stored in the sending buffer area, and the data in the sending buffer area is sent to the corresponding downstream task through the data transmission thread.
The embodiment of the present application provides a computer program, which includes computer instructions stored in a computer-readable storage medium, and when a processor of a computer device reads the computer instructions from the computer-readable storage medium, the processor executes the computer instructions, so that the computer device executes the contents as shown in the foregoing method embodiment. Compared with the prior art, the compressed data generated by the upstream task is obtained through the data transmission thread, the compressed data of the upstream task is decompressed through the data processing thread, the current task is executed, the data generated by the current task is further compressed through the current target compression algorithm, the compressed data is stored in the sending buffer area, and the data in the sending buffer area is sent to the corresponding downstream task through the data transmission thread.
It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
The foregoing is only a partial embodiment of the present application, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present application, and these modifications and decorations should also be regarded as the protection scope of the present application.

Claims (14)

1. A method for data processing in a distributed pipeline, comprising:
creating asynchronously executed data processing threads and data transmission threads, wherein the number of the data processing threads and the number of the data transmission threads are more than one;
obtaining compressed data generated by an upstream task through the data transmission thread;
decompressing the compressed data of the upstream task and executing the current task through the data processing thread, determining and compressing the data generated by the current task according to the current target compression algorithm, and storing the compressed data generated by the current task to a sending buffer area;
and sending the data in the sending buffer area to a corresponding downstream task through the data transmission thread.
2. The data processing method of claim 1, wherein the determining a current target compression algorithm comprises:
if the saturation degree of the sending buffer area does not exceed a preset range, taking a compression algorithm used by the data processing thread to compress the last block of data as the current target compression algorithm;
if the saturation degree of the sending buffer area exceeds the preset range, determining the current target compression algorithm according to a pre-established compression level list and a compression algorithm used by the data processing thread for compressing the previous block of data;
wherein the compression level list includes compression algorithms for at least two compression levels of the current task.
3. The data processing method of claim 2, wherein determining the current target compression algorithm according to the pre-established compression level list and the compression algorithm used by the data processing thread to compress the previous block of data comprises:
determining a reference compression level corresponding to a compression algorithm used by the data processing thread to compress the last block of data in the compression level list;
and determining the adjusted compression level from the compression level list as a current target compression level according to the saturation degree of the sending buffer area and the reference compression level, and determining the current target compression algorithm according to the current target compression level.
4. The data processing method of claim 3, wherein determining the adjusted compression level from the compression level list according to the saturation level of the transmission buffer and the reference compression level comprises:
if the saturation degree of the sending buffer area is lower than the lower limit value of the preset range, determining a compression level of which the data compression degree is lower than the reference compression level from the compression level list as the adjusted compression level;
and if the saturation degree of the sending buffer area is higher than the upper limit value of the preset range, determining a compression level of which the data compression degree is higher than the reference compression level from the compression level list as the adjusted compression level.
5. The data processing method according to any of claims 2 to 4, further comprising the step of building the list of compression levels:
determining the data characteristics and the query characteristics of the current task, inputting the data characteristics and the query characteristics into a pre-trained neural network model, and obtaining a compression level list of the current task;
the neural network model is trained by taking data characteristics and query characteristics of a sample task as training samples and at least taking a compression level list of the sample task as a training label.
6. The data processing method of claim 5, wherein inputting the data features and query features to a pre-trained neural network model further comprises:
obtaining an initial compression level of a current task;
correspondingly, the training labels further comprise initial compression levels of the sample tasks;
and the compression algorithm corresponding to the initial compression level in the compression level list is at least used as the compression algorithm of the first block of data generated by the current task.
7. The data processing method of claim 4, further comprising:
if the specific adjustment mode with the preset times is determined, expanding the capacity of the sending buffer area;
wherein, the specific adjustment mode is the opposite mode of adjusting the compression level in two adjacent times.
8. The data processing method of claim 4, further comprising:
if the specific adjustment mode with the preset times is determined, adjusting the proportion of different compression levels in each data processing thread;
wherein, the specific adjustment mode is the opposite mode of adjusting the compression level in two adjacent times.
9. The data processing method of claim 4, further comprising:
if the specific adjustment mode with the preset times is determined, acquiring the current data characteristics and the query characteristics of the current task;
inputting the current data characteristics and the query characteristics into a pre-trained neural network model to obtain an updated compression level list and an updated compression level for the current service;
and determining a reference compression algorithm corresponding to the updated compression level according to the updated compression level list, and compressing the data according to the reference compression algorithm.
10. The data processing method of claim 5, wherein the data characteristics include at least one of data type, maximum, minimum, order, and data distribution.
11. The data processing method of claim 5, wherein the query features include a query statement and a data source corresponding to the current task.
12. A data processing apparatus for a distributed pipeline, comprising:
the thread creating module is used for creating an asynchronously executed data processing thread and a data transmission thread, and the number of the data processing thread and the data transmission thread is more than one;
the first transmission module is used for acquiring compressed data generated by an upstream task through the data transmission thread;
the data processing module is used for decompressing the compressed data of the upstream task and executing the current task through the data processing thread, determining and compressing the data generated by the current task according to the current target compression algorithm, and storing the compressed data generated by the current task to a sending buffer area;
and the second transmission module is used for sending the data in the sending buffer area to the corresponding downstream task through the data transmission thread.
13. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the data processing method according to any of claims 1 to 11 are implemented when the computer program is executed by the processor.
14. A computer-readable storage medium storing computer instructions for causing a computer to perform the steps of the data processing method according to any one of claims 1 to 11.
CN202111537765.5A 2021-12-15 2021-12-15 Data processing method and device for distributed pipeline and storage medium Pending CN114428786A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111537765.5A CN114428786A (en) 2021-12-15 2021-12-15 Data processing method and device for distributed pipeline and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111537765.5A CN114428786A (en) 2021-12-15 2021-12-15 Data processing method and device for distributed pipeline and storage medium

Publications (1)

Publication Number Publication Date
CN114428786A true CN114428786A (en) 2022-05-03

Family

ID=81312010

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111537765.5A Pending CN114428786A (en) 2021-12-15 2021-12-15 Data processing method and device for distributed pipeline and storage medium

Country Status (1)

Country Link
CN (1) CN114428786A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117909268A (en) * 2024-03-19 2024-04-19 麒麟软件有限公司 GPU driving optimization method

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117909268A (en) * 2024-03-19 2024-04-19 麒麟软件有限公司 GPU driving optimization method
CN117909268B (en) * 2024-03-19 2024-05-24 麒麟软件有限公司 GPU driving optimization method

Similar Documents

Publication Publication Date Title
US20180357111A1 (en) Data center operation
US9053067B2 (en) Distributed data scalable adaptive map-reduce framework
Wu et al. Efficient pagerank and spmv computation on amd gpus
CN107122490B (en) Data processing method and system for aggregation function in packet query
Plale et al. dQCOB: managing large data flows using dynamic embedded queries
CN102479217B (en) Method and device for realizing computation balance in distributed data warehouse
US10268741B2 (en) Multi-nodal compression techniques for an in-memory database
WO2023179415A1 (en) Machine learning computation optimization method and platform
CN103701635A (en) Method and device for configuring Hadoop parameters on line
Meyerson et al. Online multidimensional load balancing
CN110659278A (en) Graph data distributed processing system based on CPU-GPU heterogeneous architecture
CN116263701A (en) Computing power network task scheduling method and device, computer equipment and storage medium
CN114428786A (en) Data processing method and device for distributed pipeline and storage medium
EP4091051A1 (en) Distributed computing pipeline processing
Zhang et al. Optimizing execution for pipelined‐based distributed deep learning in a heterogeneously networked GPU cluster
Li et al. Performance analysis of cambricon mlu100
US20230325149A1 (en) Data processing method and apparatus, computer device, and computer-readable storage medium
Chen et al. MRSIM: mitigating reducer skew In MapReduce
Astsatryan et al. Performance-efficient Recommendation and Prediction Service for Big Data frameworks focusing on Data Compression and In-memory Data Storage Indicators
CN110728118A (en) Cross-data-platform data processing method, device, equipment and storage medium
Faber et al. Platform agnostic streaming data application performance models
CN112100446B (en) Search method, readable storage medium, and electronic device
CN113641674A (en) Adaptive global sequence number generation method and device
Lv et al. A Survey of Graph Pre-processing Methods: From Algorithmic to Hardware Perspectives
Shastry et al. Compression acceleration using GPGPU

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination