CN111737347A

CN111737347A - Method and device for sequentially segmenting data on Spark platform

Info

Publication number: CN111737347A
Application number: CN202010540731.0A
Authority: CN
Inventors: 饶彭彦
Original assignee: Industrial and Commercial Bank of China Ltd ICBC
Current assignee: Industrial and Commercial Bank of China Ltd ICBC
Priority date: 2020-06-15
Filing date: 2020-06-15
Publication date: 2020-10-02
Anticipated expiration: 2040-06-15
Also published as: CN111737347B

Abstract

The invention discloses a method and a device for sequentially segmenting data on a Spark platform, wherein the method comprises the following steps: acquiring distribution information of a data set of the time sequence data on a Spark platform; acquiring a first data dividing point and a second data dividing point which are determined in advance according to a target data segment in the preset time sequence data; respectively determining two-dimensional dataset coordinates of the first data dividing point and the second data dividing point according to the distribution information; and determining data between the first data segmentation point and the second data segmentation point in the data set according to the two-dimensional data set coordinates, and generating a data set corresponding to the target data segment. The invention provides a data set segmentation method with low memory and network consumption on a Spark platform.

Description

Method and device for sequentially segmenting data on Spark platform

Technical Field

The invention relates to a Spark platform, in particular to a method and a device for sequentially segmenting data on the Spark platform.

Background

The financial time series data refers to values of financial random variables taken according to time sequence. The financial time sequence data has unique statistical characteristics, such as fluctuation cluster row and lever effect. In order to well characterize the statistics of the financial timing data, it is very important to reasonably model the financial timing data statistically. Before statistical modeling, a series of data processing needs to be performed on financial time series data, and data segmentation according to a time sequence is one of processing modes, and the data segmentation according to the time sequence is used for intercepting one or more pieces of data from the financial time series data.

Apache Spark is an analysis engine for large-scale data processing, which is used to build large-scale, low-latency data analysis applications. Multiple computing modes are efficiently supported, including interactive queries and stream processing. The Spark platform abstracts the data into an elastic distributed data set (RDD), and distributes the RDD on a plurality of machines of the cluster, so that the machines can simultaneously perform the same operation on the data, and the parallel processing of the data is realized.

A flexible distributed data set (RDD) is a read-only data set that can be distributed across multiple nodes in a cluster, and all or part of the contents of the data set can be cached in memory and reused between computations. Each RDD is divided into a plurality of partitions and exists on different nodes in the cluster, parallel calculation is convenient to carry out on different nodes, and the RDD provides a highly limited memory sharing model and cannot be modified. When the memory is insufficient, the RDD is stored in the disk.

In the prior art, a Spark platform using parallel computing only provides a function of randomly segmenting data, and cannot segment data according to a data sequence, so that the Spark platform is difficult to segment the time sequence data according to a sequence.

Disclosure of Invention

The present invention provides a method and an apparatus for sequentially segmenting data on a Spark platform, in order to solve at least one technical problem in the above background art.

In order to achieve the above object, according to an aspect of the present invention, there is provided a method for sequentially slicing data on a Spark platform, the method including:

acquiring distribution information of a data set of time series data on a Spark platform, wherein the distribution information comprises: the partition number of each partition of the data set and the data quantity corresponding to each partition;

acquiring a first data dividing point and a second data dividing point which are determined in advance according to a target data segment in the preset time sequence data;

respectively determining two-dimensional data set coordinates of the first data dividing point and the second data dividing point according to the distribution information, wherein a first coordinate in the two-dimensional data set coordinates is used for representing a partition number of a partition where the data dividing point is located, and a second coordinate in the two-dimensional data set coordinates is used for representing a data serial number of the data dividing point in the partition;

and determining data between the first data segmentation point and the second data segmentation point in the data set according to the two-dimensional data set coordinates, and generating a data set corresponding to the target data segment.

Optionally, the determining data between the first data segmentation point and the second data segmentation point in the data set according to the two-dimensional data set coordinate includes:

and comparing the data of the time sequence data stored in each partition with the coordinates of the two-dimensional data set to obtain the data between the first data segmentation point and the second data segmentation point in the data set.

determining a two-dimensional coordinate of each datum in a data set of time sequence data, wherein two coordinates in the two-dimensional coordinates are used for representing a partition number of a partition where the datum is located and a data sequence number of the datum in the partition;

and comparing the two-dimensional coordinates of each datum with the coordinates of the two-dimensional data set, and determining the datum between the first datum segmentation point and the second datum segmentation point in the data set.

Optionally, the first data segmentation point is the first data of the target data segment, and the second data segmentation point is the last data of the target data segment.

In order to achieve the above object, according to another aspect of the present invention, there is provided an apparatus for sequentially slicing data on a Spark platform, the apparatus including:

a data set distribution information obtaining unit, configured to obtain distribution information of a data set of time-series data on a Spark platform, where the distribution information includes: the partition number of each partition of the data set and the data quantity corresponding to each partition;

the segmentation point information acquisition unit is used for acquiring a first data segmentation point and a second data segmentation point which are determined in advance according to a target data segment in the preset time sequence data;

the two-dimensional data set coordinate determination unit is used for respectively determining two-dimensional data set coordinates of the first data dividing point and the second data dividing point according to the distribution information, wherein the first coordinate in the two-dimensional data set coordinates is used for representing a partition number of a partition where the data dividing point is located, and the second coordinate in the two-dimensional data set coordinates is used for representing a data serial number of the data dividing point in the partition;

and the data set generating unit is used for determining data between the first data segmentation point and the second data segmentation point in the data set according to the two-dimensional data set coordinates and generating a data set corresponding to the target data segment.

Optionally, the data set generating unit includes:

and the data screening module is used for comparing the data of the time sequence data set stored in each partition with the coordinates of the two-dimensional data set to obtain the data between the first data segmentation point and the second data segmentation point in the data set.

Optionally, the data set generating unit includes:

the two-dimensional coordinate determination module is used for determining a two-dimensional coordinate of each datum in a data set of the time sequence data, wherein two coordinates in the two-dimensional coordinates are used for representing a partition number of a partition where the datum is located and a data serial number of the datum in the partition;

and the comparison module is used for comparing the two-dimensional coordinates of each datum with the coordinates of the two-dimensional data set and determining the data between the first data segmentation point and the second data segmentation point in the data set.

To achieve the above object, according to another aspect of the present invention, there is also provided a computer device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the method for sequentially slicing data in a Spark platform when executing the computer program.

To achieve the above object, according to another aspect of the present invention, there is also provided a computer readable storage medium storing a computer program which, when executed in a computer processor, implements the steps in the method for sequentially slicing data in a Spark platform as described above.

The invention has the beneficial effects that: the embodiment of the invention determines the two-dimensional dataset coordinates of the data segmentation points according to the distribution information of the dataset of the time sequence data on the Spark platform, and then performs data screening according to the two-dimensional dataset coordinates to generate the preset dataset corresponding to the target data segment in the time sequence data.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts. In the drawings:

FIG. 1 is a flow chart of a method for sequentially slicing data on a Spark platform according to an embodiment of the present invention;

FIG. 2 is a flow chart of data screening according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of partitioned storage of a data set according to an embodiment of the present invention;

FIG. 4 is a block diagram of an apparatus for sequentially slicing data on a Spark platform according to an embodiment of the present invention;

FIG. 5 is a block diagram of a data generation unit according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of a computer device according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

It should be noted that the terms "comprises" and "comprising," and any variations thereof, in the description and claims of the present invention and the above-described drawings, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict. The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings.

Fig. 1 is a flowchart of a method for sequentially slicing data on a Spark platform according to an embodiment of the present invention, and as shown in fig. 1, the method for sequentially slicing data on the Spark platform according to the embodiment includes steps S101 to S104.

Step S101, obtaining distribution information of a data set of the time series data on a Spark platform, where the distribution information includes: the partition number of each partition of the data set and the data quantity corresponding to each partition.

In the embodiment of the present invention, the time series data is data with a time sequence, and the time series data is stored in a distributed manner in a Spark platform in a data set (flexible distributed data set, RDD) manner. The data set can be distributively stored on a plurality of nodes in the Spark cluster, that is, the data set is divided into a plurality of partitions in the Spark cluster, each node corresponds to one partition, and parallel computation of the data in the data set is facilitated on different nodes.

Fig. 3 is a schematic diagram of storage of each partition of a data set according to an embodiment of the present invention, where the data set shown in the embodiment of fig. 3 is composed of 4 partitions, each partition includes two pieces of data, and at this time, data of a data set of time series data in a Spark cluster is stored in each node in a segmented manner according to an original sequence of the time series data, and the original sequence of the time series data is: "1: A, 2: B, 3: C … … 8: H". For convenience of understanding, in fig. 3, 1: a is taken as an example: the 1 st piece of data in the data set of the time sequence data is called as a sequence number 1, the data content is 1: A, and so on, and the data in a single partition is stored in a corresponding memory according to a relative sequence.

In this embodiment of the present invention, in this step, distribution information of a data set of the time series data on a Spark platform is obtained, where the distribution information includes: a partition number of each partition of the data set storage, a total number of partitions of the data set storage, and a number of data in each partition of the data set storage.

Step S102, obtaining a first data dividing point and a second data dividing point which are determined in advance according to a target data segment in the preset time sequence data.

In the embodiment of the invention, a user firstly sets a target data segment which is required to be intercepted in time sequence data, and determines a first data segmentation point and a second data segmentation point according to the target data segment. In the embodiment of the present invention, a plurality of target data segments may exist simultaneously, and each target data segment corresponds to one first data dividing point and one second data dividing point. In an embodiment, the first data dividing point is a first data of the target data segment, and the second data dividing point is a last data of the target data segment.

In an alternative embodiment of the present invention, the first data dividing point and the second data dividing point are position information in the time series data, such as the 2 nd data in the time series data, the 40 th data in the time series data, and the like.

In an alternative embodiment of the invention, the first data cut point and the second data cut point may be represented by a data sequence number in the time series data. For example, the time series data includes 100 pieces of data in total, and the serial numbers of the data are 1 to 100, and if the 20 th piece of data to the 50 th piece of data in the time series data are to be cut, the first data cut point is the 20 th piece of data in the time series data and is represented by the data serial number 20 in the time series data, and the second data cut point is the 50 th piece of data in the time series data and is represented by the data serial number 50 in the time series data.

And step S103, respectively determining two-dimensional data set coordinates of the first data dividing point and the second data dividing point according to the distribution information, wherein a first coordinate in the two-dimensional data set coordinates is used for representing a partition number of a partition where the data dividing point is located, and a second coordinate in the two-dimensional data set coordinates is used for representing a data serial number of the data dividing point in the partition.

In the embodiment of the present invention, the first data dividing point and the second data dividing point are position information or data serial numbers in time series data (or a data set), and this step converts the one-dimensional first data dividing point and the second data dividing point into two-dimensional data set coordinates according to distribution information of the data set. In an embodiment, the abscissa of the two-dimensional dataset coordinate is a partition number of a partition where data corresponding to the data partition point is located, and the ordinate is a data sequence number (i.e., the number of data in the partition) in the data partition corresponding to the data partition point.

In this embodiment of the present invention, in this step, the two-dimensional dataset coordinates of the first data dividing point and the second data dividing point may be determined according to the position information or the data serial number of the first data dividing point and the second data dividing point in the time series data (or the data set), the partition serial number of each partition in the data set, and the data number in each partition.

In the embodiment shown in fig. 3, the total number of partitions is 4, the number of data in each partition is 2, it is assumed that the data set needs to be partitioned into two parts, the partitioning points correspond to the 1 st, 5 th, 6 th and 8 th pieces of data of the time series data (or the data set), the 1 st and 5 th pieces of data are the partitioning points of the first target data segment (sub data set), and the 6 th and 8 th pieces of data are the partitioning points of the second target data segment (sub data set). Taking the 5 th data as an example, namely data "5: E", the one-dimensional data set serial number, namely "5", is converted into two-dimensional coordinates required by the scheme. Knowing that the first two partitions hold the first 4 pieces of data in total, and the 5 th piece of data corresponds to the 1 st piece of data in the 3 rd partition, the division point "5" is converted into (3, 1), i.e., "the 1 st piece of data for partition No. 3".

Step S104, determining data between the first data segmentation point and the second data segmentation point in the data set according to the two-dimensional data set coordinates, and generating a data set corresponding to the target data segment.

In an embodiment, in this step, the data of the time series data set stored in each partition may be compared with the coordinates of the two-dimensional data set to obtain data between the first data segmentation point and the second data segmentation point in the data set.

In an embodiment, the data between the first data cut point and the second data cut point includes data corresponding to the first data cut point and the second data cut point.

In an embodiment, in this step, by performing a mappartitionwithindex operation, the data of each partition of the data set is compared with the coordinates of the two-dimensional data set obtained in step S102, and the data from the first data partition point to the second data partition point is determined. And then, generating a new data set, namely a data set corresponding to the target data segment, by using a flatMap operation of Spark on the premise of not moving the data in the memory and having no shuffle operation, wherein the generated new data set is a subset of the data set of the time sequence data.

As shown in fig. 3, if the 1 st to 5 th pieces of data in the data set need to be obtained, the segmentation points are converted into coordinates (1, 1) (i.e., the 1 st piece of data of partition No. 1) and (3, 1) (i.e., the 1 st piece of data of partition No. 3). After comparing the data of each partition with the coordinates of the dividing points, all the data of the partition 1 and the partition 2 and the 1 st data of the partition 3 are obtained, and the data are converted into a new data set (RDD), namely a data set corresponding to the target data segment. The generated new data set is not stored separately, and can be understood as adding an index through which data are distinguished. For example, the original RDDs are 1-8, the new RDDs are 1-5, and data are not changed in the storage media of all nodes of the cluster and are not moved across the nodes.

Due to the characteristics of the data set RDD, reducing the consumption of memory and network in the Spark cluster is an important way to improve data processing. The method directly selects the data of the memory of each node in the Spark cluster (namely the data of each partition of the data set) as required, and has no extra RDD conversion operation of the data set, thereby avoiding the network transmission among the Spark clusters, effectively reducing the memory consumption and lowering the hard requirement on the memory specification. If the existing data set RDD data is numbered first and then segmented according to the numbers, multiple operations and cross-node global numbering are needed to be carried out on the RDD, and hardware resource consumption of the Spark cluster is larger.

Fig. 2 is a flowchart of data screening according to an embodiment of the present invention, and as shown in fig. 2, the step S104 of determining data between the first data segmentation point and the second data segmentation point in the data set may specifically include a step S201 and a step S202.

Step S201, determining a two-dimensional coordinate of each data in a data set of the time series data, where two coordinates in the two-dimensional coordinates are used to indicate a partition number of a partition where the data is located and a data sequence number of the data in the partition.

In an alternative embodiment of the present invention, the present invention may determine the two-dimensional coordinates of each data in each partition in the data set according to the method of step S103. In an optional embodiment, an abscissa in the two-dimensional coordinates of the data is a partition number of a partition where the data is located, and an ordinate is a data sequence number of the data in the partition.

Step S202, comparing the two-dimensional coordinates of each datum with the coordinates of the two-dimensional data set, and determining the data between the first data segmentation point and the second data segmentation point in the data set.

In the embodiment of the invention, the two-dimensional coordinates of the data are directly compared with the two-dimensional dataset coordinates of the first data dividing point and the second data dividing point, and the data from the first data dividing point to the second data dividing point in the dataset are determined.

The embodiment shows that the method can make up the deficiency of the Apache Spark data splitting function, realize the function of sequentially splitting data and reduce the consumption of the cluster memory and the network. The time sequence data used when the machine learning model is constructed can be realized by adopting the method of the invention according to the time sequence and the data divided in proportion.

It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer-executable instructions and that, although a logical order is illustrated in the flowcharts, in some cases, the steps illustrated or described may be performed in an order different than presented herein.

Based on the same inventive concept, an embodiment of the present invention further provides a device for sequentially partitioning data on a Spark platform, which can be used to implement the method for sequentially partitioning data on the Spark platform described in the foregoing embodiment, as described in the following embodiment. Because the principle of solving the problem of the device for sequentially segmenting data on the Spark platform is similar to that of the method for sequentially segmenting data on the Spark platform, the embodiment of the device for sequentially segmenting data on the Spark platform can refer to the embodiment of the method for sequentially segmenting data on the Spark platform, and repeated parts are not described again. As used hereinafter, the term "unit" or "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.

Fig. 4 is a block diagram of a structure of a device for sequentially slicing data on a Spark platform according to an embodiment of the present invention, and as shown in fig. 4, the device for sequentially slicing data on the Spark platform according to the embodiment of the present invention includes: a data set distribution information acquisition unit 1, a segmentation point information acquisition unit 2, a two-dimensional data set coordinate determination unit 3, and a data set generation unit 4.

A data set distribution information obtaining unit 1, configured to obtain distribution information of a data set of time-series data on a Spark platform, where the distribution information includes: the partition number of each partition of the data set and the data quantity corresponding to each partition.

And the segmentation point information acquisition unit 2 is configured to acquire a first data segmentation point and a second data segmentation point which are determined in advance according to a preset target data segment in the time sequence data.

And the two-dimensional data set coordinate determining unit 3 is configured to determine two-dimensional data set coordinates of the first data dividing point and the second data dividing point according to the distribution information, where a first coordinate in the two-dimensional data set coordinates is used to represent a partition number of a partition where the data dividing point is located, and a second coordinate in the two-dimensional data set coordinates is used to represent a data sequence number of the data dividing point in the partition.

And the data set generating unit 4 is configured to determine, according to the two-dimensional data set coordinates, data between the first data segmentation point and the second data segmentation point in the data set, and generate a data set corresponding to the target data segment.

In an optional embodiment of the present invention, the data set generating unit 4 includes:

Fig. 5 is a block diagram of a data generating unit according to an embodiment of the present invention, and as shown in fig. 5, in an alternative embodiment of the present invention, the data set generating unit 4 includes a two-dimensional coordinate determining module 401 and a comparing module 402.

The two-dimensional coordinate determining module 401 is configured to determine a two-dimensional coordinate of each piece of data in the data set of the time series data, where two of the two-dimensional coordinates are used to indicate a partition number of a partition where the piece of data is located and a data sequence number of the piece of data in the partition.

A comparing module 402, configured to compare the two-dimensional coordinates of each data with the two-dimensional dataset coordinates, and determine data between the first data segmentation point and the second data segmentation point in the dataset.

To achieve the above object, according to another aspect of the present application, there is also provided a computer apparatus. As shown in fig. 6, the computer device comprises a memory, a processor, a communication interface and a communication bus, wherein a computer program that can be run on the processor is stored in the memory, and the steps of the method of the above embodiment are realized when the processor executes the computer program.

The processor may be a Central Processing Unit (CPU). The Processor may also be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, or a combination thereof.

The memory, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and units, such as the corresponding program units in the above-described method embodiments of the present invention. The processor executes various functional applications of the processor and the processing of the work data by executing the non-transitory software programs, instructions and modules stored in the memory, that is, the method in the above method embodiment is realized.

The memory may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created by the processor, and the like. Further, the memory may include high speed random access memory, and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory located remotely from the processor, and such remote memory may be coupled to the processor via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The one or more units are stored in the memory and when executed by the processor perform the method of the above embodiments.

The specific details of the computer device may be understood by referring to the corresponding related descriptions and effects in the above embodiments, and are not described herein again.

To achieve the above object, according to another aspect of the present application, there is also provided a computer-readable storage medium storing a computer program which, when executed in a computer processor, implements the steps in the method for sequentially slicing data in a Spark platform as described above. It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a Random Access Memory (RAM), a Flash Memory (Flash Memory), a Hard Disk Drive (Hard Disk Drive, abbreviated as HDD), a Solid State Drive (SSD), or the like; the storage medium may also comprise a combination of memories of the kind described above.

It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and they may alternatively be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, or fabricated separately as individual integrated circuit modules, or fabricated as a single integrated circuit module from multiple modules or steps. Thus, the present invention is not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for sequentially segmenting data on a Spark platform is characterized by comprising the following steps:

2. The method for sequentially slicing data on a Spark platform according to claim 1, wherein said determining data between said first data slicing point and said second data slicing point in said dataset according to said two-dimensional dataset coordinates comprises:

3. The method for sequentially slicing data on a Spark platform according to claim 1, wherein said determining data between said first data slicing point and said second data slicing point in said dataset according to said two-dimensional dataset coordinates comprises:

4. The method of claim 1, wherein the first data segmentation point is a first data of the target data segment, and the second data segmentation point is a last data of the target data segment.

5. An apparatus for sequentially slicing data on a Spark platform, comprising:

6. The apparatus for sequentially slicing data on a Spark platform as claimed in claim 5, wherein said data set generating unit comprises:

7. The apparatus for sequentially slicing data on a Spark platform as claimed in claim 5, wherein said data set generating unit comprises:

8. The apparatus of claim 5, wherein the first data slicing point is a first data of the target data segment, and the second data slicing point is a last data of the target data segment.

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method of any of claims 1 to 4 when executing the computer program.

10. A computer-readable storage medium, in which a computer program is stored which, when executed in a computer processor, implements the method of any one of claims 1 to 4.