CN111737347B - Method and device for sequentially segmenting data on Spark platform - Google Patents

Method and device for sequentially segmenting data on Spark platform Download PDF

Info

Publication number
CN111737347B
CN111737347B CN202010540731.0A CN202010540731A CN111737347B CN 111737347 B CN111737347 B CN 111737347B CN 202010540731 A CN202010540731 A CN 202010540731A CN 111737347 B CN111737347 B CN 111737347B
Authority
CN
China
Prior art keywords
data
data set
coordinates
dimensional
partition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010540731.0A
Other languages
Chinese (zh)
Other versions
CN111737347A (en
Inventor
饶彭彦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Industrial and Commercial Bank of China Ltd ICBC
Original Assignee
Industrial and Commercial Bank of China Ltd ICBC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Industrial and Commercial Bank of China Ltd ICBC filed Critical Industrial and Commercial Bank of China Ltd ICBC
Priority to CN202010540731.0A priority Critical patent/CN111737347B/en
Publication of CN111737347A publication Critical patent/CN111737347A/en
Application granted granted Critical
Publication of CN111737347B publication Critical patent/CN111737347B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • G06F16/278Data partitioning, e.g. horizontal or vertical partitioning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2474Sequence data queries, e.g. querying versioned data
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a method and a device for sequentially segmenting data on a Spark platform, wherein the method comprises the following steps: acquiring distribution information of a data set of time sequence data on a Spark platform; acquiring a first data dividing point and a second data dividing point which are determined in advance according to a target data segment in the preset time sequence data; respectively determining two-dimensional data set coordinates of the first data dividing point and the second data dividing point according to the distribution information; and determining data between the first data segmentation point and the second data segmentation point in the data set according to the two-dimensional data set coordinates, and generating a data set corresponding to the target data segment. The invention provides a data set segmentation method with lower memory and network consumption on Spark platform.

Description

Method and device for sequentially segmenting data on Spark platform
Technical Field
The invention relates to a Spark platform, in particular to a method and a device for sequentially segmenting data on the Spark platform.
Background
The financial time sequence data refers to values taken by the financial random variables in chronological order. Financial time series data have unique statistical characteristics, such as fluctuation cluster rows and lever effect. In order to be able to well characterize the statistics of the financial time series data, it is important to reasonably model the financial time series data. Before statistical modeling, a series of data processing needs to be performed on the financial time series data, namely, one processing mode is to divide the data according to time sequence, and the data is divided into one or more sections of data which are needed to be cut out from the financial time series data according to time sequence.
Apache Spark is an analysis engine for large-scale data processing to build large, low-latency data analysis applications. Multiple computing modes are efficiently supported, including interactive query and stream processing. The Spark platform abstracts data into an elastic distributed data set (RDD) and distributes the data set on a plurality of machines of a cluster, so that the plurality of machines can simultaneously perform the same operation on the data, and parallel processing of the data is realized.
An elastic distributed data set (RDD) is a read-only data set that can be distributed across multiple nodes in a cluster, and all or part of the contents of this data set can be cached in memory for reuse among multiple computations. Each RDD is divided into a plurality of partitions and exists on different nodes in the cluster, so that parallel computation is conveniently carried out on the different nodes, and the RDD provides a highly-limited memory sharing model which cannot be modified. When the memory is insufficient, the RDD is saved in the disk.
In the prior art, the Spark platform using parallel computing only provides a function of randomly splitting data, and cannot split the data according to the data sequence, so that the Spark platform is difficult to realize data splitting of time sequence data according to the sequence.
Disclosure of Invention
The invention provides a method and a device for sequentially segmenting data on a Spark platform in order to solve at least one technical problem in the background art.
To achieve the above object, according to one aspect of the present invention, there is provided a method of sequentially slicing data in a Spark platform, the method comprising:
acquiring distribution information of a data set of time sequence data on a Spark platform, wherein the distribution information comprises: partition numbers of each partition of the data set and data amounts corresponding to each partition;
acquiring a first data dividing point and a second data dividing point which are determined in advance according to a target data segment in the preset time sequence data;
respectively determining two-dimensional data set coordinates of the first data dividing point and the second data dividing point according to the distribution information, wherein the first coordinates in the two-dimensional data set coordinates are used for representing partition numbers of partitions where the data dividing points are located, and the second coordinates in the two-dimensional data set coordinates are used for representing data sequence numbers of the data dividing points in the partitions;
and determining data between the first data segmentation point and the second data segmentation point in the data set according to the two-dimensional data set coordinates, and generating a data set corresponding to the target data segment.
Optionally, the determining the data between the first data slicing point and the second data slicing point in the data set according to the two-dimensional data set coordinates includes:
and comparing the data set of the time sequence data stored in each partition with the two-dimensional data set coordinates to obtain the data between the first data dividing point and the second data dividing point in the data set.
Optionally, the determining the data between the first data slicing point and the second data slicing point in the data set according to the two-dimensional data set coordinates includes:
determining two-dimensional coordinates of each datum in a data set of time sequence data, wherein two coordinates in the two-dimensional coordinates are used for representing a partition number of a partition where the data is located and a data sequence number of the data in the partition;
comparing the two-dimensional coordinates of each data with the two-dimensional data set coordinates, and determining the data between the first data dividing point and the second data dividing point in the data set.
Optionally, the first data slicing point is the first data of the target data segment, and the second data slicing point is the last data of the target data segment.
To achieve the above object, according to another aspect of the present invention, there is provided an apparatus for sequentially slicing data in a Spark platform, the apparatus comprising:
the data set distribution information acquisition unit is used for acquiring distribution information of a data set of time sequence data on a Spark platform, wherein the distribution information comprises: partition numbers of each partition of the data set and data amounts corresponding to each partition;
the dividing point information acquisition unit is used for acquiring a first data dividing point and a second data dividing point which are determined in advance according to a target data segment in the preset time sequence data;
the two-dimensional data set coordinate determining unit is used for respectively determining two-dimensional data set coordinates of the first data dividing point and the second data dividing point according to the distribution information, wherein a first coordinate in the two-dimensional data set coordinates is used for representing a partition number of a partition where the data dividing point is located, and a second coordinate in the two-dimensional data set coordinates is used for representing a data sequence number of the data dividing point in the partition;
and the data set generating unit is used for determining data between the first data dividing point and the second data dividing point in the data set according to the two-dimensional data set coordinates and generating a data set corresponding to the target data segment.
Optionally, the data set generating unit includes:
and the data screening module is used for comparing the data stored in each partition of the data set of the time sequence data with the coordinates of the two-dimensional data set to obtain the data between the first data dividing point and the second data dividing point in the data set.
Optionally, the data set generating unit includes:
the system comprises a two-dimensional coordinate determining module, a data processing module and a data processing module, wherein the two-dimensional coordinate determining module is used for determining the two-dimensional coordinate of each data in a data set of time sequence data, and two coordinates in the two-dimensional coordinates are used for representing the partition number of a partition where the data is located and the data sequence number of the data in the partition;
and the comparison module is used for comparing the two-dimensional coordinates of each data with the two-dimensional data set coordinates and determining the data between the first data dividing point and the second data dividing point in the data set.
To achieve the above object, according to another aspect of the present invention, there is also provided a computer device including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the steps in the method for sequentially slicing data in a Spark platform when the processor executes the computer program.
To achieve the above object, according to another aspect of the present invention, there is also provided a computer readable storage medium storing a computer program which, when executed in a computer processor, implements the steps of the method for sequentially slicing data in a Spark platform described above.
The beneficial effects of the invention are as follows: according to the method, the two-dimensional data set coordinates of the data dividing points are determined according to the distribution information of the data sets of the time sequence data on the Spark platform, and then data screening is carried out according to the two-dimensional data set coordinates, so that the data sets corresponding to the target data segments in the preset time sequence data are generated.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art. In the drawings:
FIG. 1 is a flow chart of a method for sequentially slicing data at a Spark platform according to an embodiment of the present invention;
FIG. 2 is a flow chart of data screening according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of the storage of partitions of a dataset according to an embodiment of the invention;
FIG. 4 is a block diagram of an apparatus for sequentially slicing data on a Spark platform according to an embodiment of the present invention;
FIG. 5 is a block diagram showing the structure of a data generating unit according to an embodiment of the present invention;
FIG. 6 is a schematic diagram of a computer device according to an embodiment of the invention.
Detailed Description
In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.
It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
It is noted that the terms "comprises" and "comprising," and any variations thereof, in the description and claims of the present invention and in the foregoing figures, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus.
It should be noted that, without conflict, the embodiments of the present invention and features of the embodiments may be combined with each other. The invention will be described in detail below with reference to the drawings in connection with embodiments.
Fig. 1 is a flowchart of a method for sequentially splitting data on a Spark platform according to an embodiment of the present invention, as shown in fig. 1, the method for sequentially splitting data on a Spark platform of the present invention includes steps S101 to S104.
Step S101, acquiring distribution information of a data set of time-series data on a Spark platform, where the distribution information includes: and the partition numbers of the partitions of the data set and the data quantity corresponding to the partitions respectively.
In the embodiment of the invention, the time sequence data is data with time sequence, and the time sequence data is stored in a distributed mode in a data set (elastic distributed data set (RDD)) in a Spark platform. The data set can be distributed and stored on a plurality of nodes in the Spark cluster, namely the data set is divided into a plurality of partitions in the Spark cluster, each node corresponds to one partition, and parallel calculation of data in the data set is facilitated in different nodes.
Fig. 3 is a schematic diagram of storing each partition of a data set according to an embodiment of the present invention, where the data set shown in the embodiment of fig. 3 is composed of 4 partitions, each partition includes two pieces of data, and at this time, the data of the data set of the time sequence data in the Spark cluster is stored in each node in segments according to the original sequence of the time sequence data, and the original sequence of the time sequence data is: "1:A,2:B,3:C … … 8:H". For ease of understanding, in fig. 3, "1:A" is taken as an example: the 1 st piece of data in the data set of the time series data is called as a serial number 1, the data content is 1:A, and the like, and the data in the single partition are stored in the corresponding memory in relative order.
In the embodiment of the present invention, the step acquires distribution information of a data set of time-series data on a Spark platform, where the distribution information includes: the partition number of each partition of the data set store, the total number of partitions of the data set store, and the amount of data in each partition of the data set store.
Step S102, a first data dividing point and a second data dividing point which are determined in advance according to a target data segment in the preset time sequence data are obtained.
In the embodiment of the invention, a user firstly sets a target data segment which is wanted to be intercepted in time sequence data, and determines a first data dividing point and a second data dividing point according to the target data segment. In the embodiment of the invention, a plurality of target data segments can exist at the same time, and each target data segment corresponds to a first data segmentation point and a second data segmentation point. In an embodiment, the first data slicing point is the first data of the target data segment, and the second data slicing point is the last data of the target data segment.
In an alternative embodiment of the present invention, the first data dividing point and the second data dividing point are location information in the time series data, such as the 2 nd data in the time series data, the 40 th data in the time series data, and so on.
In an alternative embodiment of the present invention, the first data slicing point and the second data slicing point may be represented by data sequence numbers in the time series data. For example, the time series data contains 100 pieces of data, the serial numbers of the data are 1 to 100, if the 20 th to 50 th pieces of data in the time series data are to be segmented, the first data segmentation point is the 20 th piece of data in the time series data and is represented by the data serial number 20 in the time series data, and the second data segmentation point is the 50 th piece of data in the time series data and is represented by the data serial number 50 in the time series data.
Step S103, two-dimensional data set coordinates of the first data dividing point and the second data dividing point are respectively determined according to the distribution information, wherein a first coordinate in the two-dimensional data set coordinates is used for representing a partition number of a partition where the data dividing point is located, and a second coordinate in the two-dimensional data set coordinates is used for representing a data sequence number of the data dividing point in the partition.
In the embodiment of the invention, the first data dividing point and the second data dividing point are position information or data serial numbers in time sequence data (or data sets), and the first data dividing point and the second data dividing point in one dimension are converted into two-dimensional data set coordinates according to the distribution information of the data sets. In an embodiment, the abscissa of the two-dimensional dataset coordinates is the partition number of the partition where the data corresponding to the data slicing point is located, and the ordinate is the data sequence number (i.e. the number of data in the partition) in the data partition corresponding to the data slicing point.
In a specific embodiment of the present invention, the step may determine two-dimensional dataset coordinates of the first data slicing point and the second data slicing point according to the position information or the data sequence number of the first data slicing point and the second data slicing point in the time series data (or dataset), the partition sequence number of each partition in the dataset, and the data amount in each partition.
In the embodiment shown in fig. 3, the total number of partitions is 4, the number of data in each partition is 2, and it is assumed that the data set needs to be split into two parts, the splitting points are respectively 1 st, 5 th, 6 th and 8 th data of the time sequence data (or the data set), the 1 st and 5 th data are splitting points of a first target data segment (sub data set), and the 6 th and 8 th data are splitting points of a second target data segment (sub data set). Taking the 5 th data as an example, namely data "5:E", the one-dimensional data set serial number, namely "5", is converted into two-dimensional coordinates needed by the scheme. It is known that the first two partitions hold the first 4 pieces of data in total, and the 5 th piece of data corresponds to the 1 st piece of data in the 3 rd partition, the cut point "5" is converted into (3, 1), that is, "the 1 st piece of data of the 3 rd partition".
Step S104, determining data between the first data segmentation point and the second data segmentation point in the data set according to the two-dimensional data set coordinates, and generating a data set corresponding to the target data segment.
In an embodiment, this step may specifically compare the data stored in the data set of the time series data in each partition with the coordinates of the two-dimensional data set to obtain the data between the first data dividing point and the second data dividing point in the data set.
In an embodiment, the data between the first data slicing point and the second data slicing point includes data corresponding to the first data slicing point and the second data slicing point.
In one embodiment, the step compares and filters the data of each partition of the dataset with the two-dimensional dataset coordinates obtained in step S102 through the mappartitionwithindex operation, and determines the data from the first data segmentation point to the second data segmentation point. And then generating a new data set, namely the data set corresponding to the target data segment, on the premise of not moving the data in the memory and not having a shuffle operation through the flash map operation of Spark, wherein the generated new data set is a subset of the data set of the time sequence data.
As shown in fig. 3, if it is desired to obtain pieces 1 to 5 of data in the dataset, the cut point is converted into coordinates (1, 1) (i.e., pieces 1 of data of partition No. 1) and (3, 1) (i.e., pieces 1 of data of partition No. 3). Comparing the data of each partition with the coordinates of the splitting points, all the data of the partition 1 and the partition 2 and the 1 st data of the partition 3 are obtained, and the data are converted into a new data set (RDD), namely the data set corresponding to the target data segment. The new data set generated is not stored separately, and it can be understood that the index is added, and the data is distinguished by the index. For example, the original RDD is 1-8, the new RDD is 1-5, and the data is not changed and moved across nodes in the storage medium of each node of the cluster.
Due to the nature of the data set RDD, reducing memory and network consumption in Spark clusters is an important way to promote data processing. The method directly selects the data of each node self memory (namely the data of each partition of the data set) in the Spark clusters according to the need, has no additional RDD conversion operation of the data set, avoids network transmission among the Spark clusters, effectively reduces the memory consumption and reduces the hard requirement on the memory specification. If the RDD data of the existing data set is numbered firstly and then segmented according to the number, multiple operations and global number crossing nodes are needed to be carried out on the RDD, and the consumption of Spark cluster hardware resources is larger.
Fig. 2 is a flowchart of data filtering according to an embodiment of the present invention, as shown in fig. 2, the determining, in step S104, data between the first data slicing point and the second data slicing point in the data set may specifically include step S201 and step S202.
In step S201, two-dimensional coordinates of each data in the data set of the time-series data are determined, wherein two coordinates in the two-dimensional coordinates are used for representing a partition number of a partition where the data is located and a data sequence number of the data in the partition.
In an alternative embodiment of the present invention, the present invention may determine the two-dimensional coordinates of each data in each partition in the data set according to the method of step S103. In an alternative embodiment, the abscissa in the two-dimensional coordinates of the data is the partition number of the partition in which the data is located, and the ordinate is the data sequence number of the data in the partition.
Step S202, comparing the two-dimensional coordinates of each data with the two-dimensional data set coordinates, and determining the data between the first data slicing point and the second data slicing point in the data set.
In the embodiment of the invention, the two-dimensional coordinates of the data are directly compared with the two-dimensional data set coordinates of the first data dividing point and the second data dividing point, and the data from the first data dividing point to the second data dividing point in the data set are determined.
The above embodiment shows that the method of the invention can make up for the deficiency of the Apache Spark data splitting function, realize the function of sequentially splitting data, and reduce the consumption of cluster memory and network. The time sequence data used in the construction of the machine learning model can be realized by adopting the method of the invention by dividing the data proportionally according to the time sequence.
It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer executable instructions, and that although a logical order is illustrated in the flowcharts, in some cases the steps illustrated or described may be performed in an order other than that illustrated herein.
Based on the same inventive concept, the embodiment of the present invention further provides a device for sequentially slicing data on a Spark platform, which may be used to implement the method for sequentially slicing data on a Spark platform described in the foregoing embodiment, as described in the following embodiments. Because the principle of the device for sequentially slicing data at the Spark platform to solve the problem is similar to that of the method for sequentially slicing data at the Spark platform, the embodiment of the device for sequentially slicing data at the Spark platform can be referred to the embodiment of the method for sequentially slicing data at the Spark platform, and the repetition is omitted. As used below, the term "unit" or "module" may be a combination of software and/or hardware that implements the intended function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.
Fig. 4 is a structural block diagram of an apparatus for sequentially slicing data on a Spark platform according to an embodiment of the present invention, as shown in fig. 4, the apparatus for sequentially slicing data on a Spark platform according to an embodiment of the present invention includes: a data set distribution information acquisition unit 1, a cut point information acquisition unit 2, a two-dimensional data set coordinate determination unit 3, and a data set generation unit 4.
A data set distribution information obtaining unit 1, configured to obtain distribution information of a data set of time-series data on a Spark platform, where the distribution information includes: and the partition numbers of the partitions of the data set and the data quantity corresponding to the partitions respectively.
And the dividing point information acquisition unit 2 is used for acquiring a first data dividing point and a second data dividing point which are determined in advance according to a target data segment in the preset time sequence data.
And the two-dimensional data set coordinate determining unit 3 is used for respectively determining the two-dimensional data set coordinates of the first data dividing point and the second data dividing point according to the distribution information, wherein a first coordinate in the two-dimensional data set coordinates is used for representing the partition number of the partition where the data dividing point is located, and a second coordinate in the two-dimensional data set coordinates is used for representing the data sequence number of the data dividing point in the partition.
And the data set generating unit 4 is used for determining data between the first data dividing point and the second data dividing point in the data set according to the two-dimensional data set coordinates and generating a data set corresponding to the target data segment.
In an alternative embodiment of the present invention, the data set generating unit 4 includes:
and the data screening module is used for comparing the data stored in each partition of the data set of the time sequence data with the coordinates of the two-dimensional data set to obtain the data between the first data dividing point and the second data dividing point in the data set.
Fig. 5 is a block diagram of the data generating unit according to an embodiment of the present invention, and as shown in fig. 5, in an alternative embodiment of the present invention, the data set generating unit 4 includes a two-dimensional coordinate determining module 401 and a comparing module 402.
The two-dimensional coordinate determining module 401 is configured to determine two-dimensional coordinates of each data in the data set of the time-series data, where two coordinates in the two-dimensional coordinates are used to represent a partition number of a partition where the data is located and a data sequence number of the data in the partition.
A comparison module 402, configured to compare the two-dimensional coordinates of each data with the two-dimensional data set coordinates, and determine data between the first data slicing point and the second data slicing point in the data set.
To achieve the above object, according to another aspect of the present application, there is also provided a computer apparatus. As shown in fig. 6, the computer device includes a memory, a processor, a communication interface, and a communication bus, where a computer program executable on the processor is stored on the memory, and when the processor executes the computer program, the steps in the method of the above embodiment are implemented.
The processor may be a central processing unit (Central Processing Unit, CPU). The processor may also be any other general purpose processor, digital signal processor (Digital Signal Processor, DSP), application specific integrated circuit (Application Specific Integrated Circuit, ASIC), field programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof.
The memory is used as a non-transitory computer readable storage medium for storing non-transitory software programs, non-transitory computer executable programs, and units, such as corresponding program units in the above-described method embodiments of the invention. The processor executes the various functional applications of the processor and the processing of the composition data by running non-transitory software programs, instructions and modules stored in the memory, i.e., implementing the methods of the method embodiments described above.
The memory may include a memory program area and a memory data area, wherein the memory program area may store an operating system, at least one application program required for a function; the storage data area may store data created by the processor, etc. In addition, the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory may optionally include memory located remotely from the processor, the remote memory being connectable to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The one or more units are stored in the memory, which when executed by the processor, performs the method in the above embodiments.
The details of the computer device may be correspondingly understood by referring to the corresponding relevant descriptions and effects in the above embodiments, and will not be repeated here.
To achieve the above object, according to another aspect of the present application, there is also provided a computer readable storage medium storing a computer program which, when executed in a computer processor, implements the steps of the method for sequentially slicing data in a Spark platform described above. It will be appreciated by those skilled in the art that implementing all or part of the above-described embodiment method may be implemented by a computer program to instruct related hardware, where the program may be stored in a computer readable storage medium, and the program may include the above-described embodiment method when executed. The storage medium may be a random access Memory (RandomAccessMemory, RAM), a Flash Memory (Flash Memory), a Hard Disk (HDD), a Solid State Drive (SSD), or the like; the storage medium may also comprise a combination of memories of the kind described above.
It will be apparent to those skilled in the art that the modules or steps of the invention described above may be implemented in a general purpose computing device, they may be concentrated on a single computing device, or distributed across a network of computing devices, or they may alternatively be implemented in program code executable by computing devices, such that they may be stored in a memory device for execution by the computing devices, or they may be separately fabricated into individual integrated circuit modules, or multiple modules or steps within them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.
The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (8)

1. A method for sequentially slicing data at a Spark platform, comprising:
acquiring distribution information of a data set of time sequence data on a Spark platform, wherein the distribution information comprises: the method comprises the steps that partition numbers of all partitions of a data set and data amounts corresponding to all the partitions respectively are stored in a distributed mode on a plurality of nodes in a Spark cluster, and each node corresponds to one partition;
acquiring a first data dividing point and a second data dividing point which are determined in advance according to a target data segment in the preset time sequence data;
respectively determining two-dimensional data set coordinates of the first data dividing point and the second data dividing point according to the distribution information, wherein the first coordinates in the two-dimensional data set coordinates are used for representing partition numbers of partitions where the data dividing points are located, and the second coordinates in the two-dimensional data set coordinates are used for representing data sequence numbers of the data dividing points in the partitions;
determining data between the first data segmentation point and the second data segmentation point in the data set according to the two-dimensional data set coordinates, and generating a data set corresponding to the target data segment;
the determining data between the first data slicing point and the second data slicing point in the dataset according to the two-dimensional dataset coordinates comprises:
determining two-dimensional coordinates of each datum in a data set of time sequence data, wherein two coordinates in the two-dimensional coordinates are used for representing a partition number of a partition where the data is located and a data sequence number of the data in the partition;
comparing the two-dimensional coordinates of each data with the two-dimensional data set coordinates, and determining the data between the first data dividing point and the second data dividing point in the data set.
2. The method of claim 1, wherein determining data in the dataset between the first data slicing point and the second data slicing point based on the two-dimensional dataset coordinates comprises:
and comparing the data set of the time sequence data stored in each partition with the two-dimensional data set coordinates to obtain the data between the first data dividing point and the second data dividing point in the data set.
3. The method of claim 1, wherein the first data slicing point is a first data of the target data segment and the second data slicing point is a last data of the target data segment.
4. An apparatus for sequentially slicing data at a Spark platform, comprising:
the data set distribution information acquisition unit is used for acquiring distribution information of a data set of time sequence data on a Spark platform, wherein the distribution information comprises: the method comprises the steps that partition numbers of all partitions of a data set and data amounts corresponding to all the partitions respectively are stored in a distributed mode on a plurality of nodes in a Spark cluster, and each node corresponds to one partition;
the dividing point information acquisition unit is used for acquiring a first data dividing point and a second data dividing point which are determined in advance according to a target data segment in the preset time sequence data;
the two-dimensional data set coordinate determining unit is used for respectively determining two-dimensional data set coordinates of the first data dividing point and the second data dividing point according to the distribution information, wherein a first coordinate in the two-dimensional data set coordinates is used for representing a partition number of a partition where the data dividing point is located, and a second coordinate in the two-dimensional data set coordinates is used for representing a data sequence number of the data dividing point in the partition;
the data set generating unit is used for determining data between the first data dividing point and the second data dividing point in the data set according to the two-dimensional data set coordinates and generating a data set corresponding to the target data segment;
the data set generating unit includes:
the system comprises a two-dimensional coordinate determining module, a data processing module and a data processing module, wherein the two-dimensional coordinate determining module is used for determining the two-dimensional coordinate of each data in a data set of time sequence data, and two coordinates in the two-dimensional coordinates are used for representing the partition number of a partition where the data is located and the data sequence number of the data in the partition;
and the comparison module is used for comparing the two-dimensional coordinates of each data with the two-dimensional data set coordinates and determining the data between the first data dividing point and the second data dividing point in the data set.
5. The apparatus for sequentially slicing data in a Spark platform of claim 4, wherein said data set generating unit comprises:
and the data screening module is used for comparing the data stored in each partition of the data set of the time sequence data with the coordinates of the two-dimensional data set to obtain the data between the first data dividing point and the second data dividing point in the data set.
6. The apparatus for sequentially slicing data in a Spark platform of claim 4 wherein said first data slicing point is a first data of said target data segment and said second data slicing point is a last data of said target data segment.
7. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method of any of claims 1 to 3 when executing the computer program.
8. A computer readable storage medium storing a computer program, characterized in that the computer program when executed in a computer processor implements the method of any one of claims 1 to 3.
CN202010540731.0A 2020-06-15 2020-06-15 Method and device for sequentially segmenting data on Spark platform Active CN111737347B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010540731.0A CN111737347B (en) 2020-06-15 2020-06-15 Method and device for sequentially segmenting data on Spark platform

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010540731.0A CN111737347B (en) 2020-06-15 2020-06-15 Method and device for sequentially segmenting data on Spark platform

Publications (2)

Publication Number Publication Date
CN111737347A CN111737347A (en) 2020-10-02
CN111737347B true CN111737347B (en) 2024-02-13

Family

ID=72649141

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010540731.0A Active CN111737347B (en) 2020-06-15 2020-06-15 Method and device for sequentially segmenting data on Spark platform

Country Status (1)

Country Link
CN (1) CN111737347B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112817976B (en) * 2021-01-26 2024-04-05 广州欢网科技有限责任公司 ID generation method, system, computer and readable instruction storage medium

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007036839A (en) * 2005-07-28 2007-02-08 Nippon Telegr & Teleph Corp <Ntt> Apparatus, system, and method for dividing quality deterioration in packet exchange network
CN106682116A (en) * 2016-12-08 2017-05-17 重庆邮电大学 OPTICS point sorting clustering method based on Spark memory computing big data platform
CA2981555A1 (en) * 2016-10-07 2018-04-07 1Qb Information Technologies Inc. System and method for displaying data representative of a large dataset
CN108256088A (en) * 2018-01-23 2018-07-06 清华大学 A kind of storage method and system of the time series data based on key value database
CN109325026A (en) * 2018-08-14 2019-02-12 中国平安人寿保险股份有限公司 Data processing method, device, equipment and medium based on big data platform
CN109656980A (en) * 2018-12-27 2019-04-19 Oppo(重庆)智能科技有限公司 Data processing method, electronic equipment, device and readable storage medium storing program for executing
CN110765154A (en) * 2019-10-16 2020-02-07 华电莱州发电有限公司 Method and device for processing mass real-time generated data of thermal power plant
CN111159235A (en) * 2019-12-20 2020-05-15 中国建设银行股份有限公司 Data pre-partition method and device, electronic equipment and readable storage medium
CN111259117A (en) * 2020-01-16 2020-06-09 广州拉卡拉信息技术有限公司 Short text batch matching method and device

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007036839A (en) * 2005-07-28 2007-02-08 Nippon Telegr & Teleph Corp <Ntt> Apparatus, system, and method for dividing quality deterioration in packet exchange network
CA2981555A1 (en) * 2016-10-07 2018-04-07 1Qb Information Technologies Inc. System and method for displaying data representative of a large dataset
CN106682116A (en) * 2016-12-08 2017-05-17 重庆邮电大学 OPTICS point sorting clustering method based on Spark memory computing big data platform
CN108256088A (en) * 2018-01-23 2018-07-06 清华大学 A kind of storage method and system of the time series data based on key value database
CN109325026A (en) * 2018-08-14 2019-02-12 中国平安人寿保险股份有限公司 Data processing method, device, equipment and medium based on big data platform
CN109656980A (en) * 2018-12-27 2019-04-19 Oppo(重庆)智能科技有限公司 Data processing method, electronic equipment, device and readable storage medium storing program for executing
CN110765154A (en) * 2019-10-16 2020-02-07 华电莱州发电有限公司 Method and device for processing mass real-time generated data of thermal power plant
CN111159235A (en) * 2019-12-20 2020-05-15 中国建设银行股份有限公司 Data pre-partition method and device, electronic equipment and readable storage medium
CN111259117A (en) * 2020-01-16 2020-06-09 广州拉卡拉信息技术有限公司 Short text batch matching method and device

Also Published As

Publication number Publication date
CN111737347A (en) 2020-10-02

Similar Documents

Publication Publication Date Title
US8943011B2 (en) Methods and systems for using map-reduce for large-scale analysis of graph-based data
US20180081034A1 (en) Method and device for constructing spatial index of massive point cloud data
CN107122490B (en) Data processing method and system for aggregation function in packet query
CN105956666B (en) A kind of machine learning method and system
Patwary et al. Window-based streaming graph partitioning algorithm
Su et al. Tolerating correlated failures in massively parallel stream processing engines
Bala et al. P-ETL: Parallel-ETL based on the MapReduce paradigm
CN111737347B (en) Method and device for sequentially segmenting data on Spark platform
CN109241193A (en) The treating method and apparatus and server cluster of distributed data base
Chen et al. HiClus: Highly scalable density-based clustering with heterogeneous cloud
CN111125248A (en) Big data storage analysis query system
CN114461858A (en) Causal relationship analysis model construction and causal relationship analysis method
Yang et al. Efficient parallel and adaptive partitioning for load-balancing in spatial join
CN108334532A (en) A kind of Eclat parallel methods, system and device based on Spark
CN112614207A (en) Contour line drawing method, device and equipment
KR20180085633A (en) Method and apparatus for processing query
CN115982230A (en) Cross-data-source query method, system, equipment and storage medium of database
CN111143456B (en) Spark-based Cassandra data import method, device, equipment and medium
CN110083598B (en) Spark-oriented remote sensing data indexing method, system and electronic equipment
CN113868267A (en) Method for injecting time sequence data, method for inquiring time sequence data and database system
Khan et al. Computational performance analysis of cluster-based technologies for big data analytics
Liu et al. Model‐based MPI‐IO tuning with Periscope tuning framework
CN113312312B (en) Distributed index method and system for efficiently querying stream data based on LSM
Tarmur et al. Parallel classification of spatial points into geographical regions
CN116450872B (en) Spark distributed vector grid turning method, system and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant