WO2019169619A1 - 大数据随机采样数据子块的划分方法及装置 - Google Patents

大数据随机采样数据子块的划分方法及装置 Download PDF

Info

Publication number
WO2019169619A1
WO2019169619A1 PCT/CN2018/078509 CN2018078509W WO2019169619A1 WO 2019169619 A1 WO2019169619 A1 WO 2019169619A1 CN 2018078509 W CN2018078509 W CN 2018078509W WO 2019169619 A1 WO2019169619 A1 WO 2019169619A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
sub
block
blocks
pieces
Prior art date
Application number
PCT/CN2018/078509
Other languages
English (en)
French (fr)
Inventor
黄哲学
何玉林
张晓亮
魏承昊
朱胡飞
Original Assignee
深圳大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳大学 filed Critical 深圳大学
Priority to PCT/CN2018/078509 priority Critical patent/WO2019169619A1/zh
Publication of WO2019169619A1 publication Critical patent/WO2019169619A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • G06F9/5066Algorithms for mapping a plurality of inter-dependent sub-tasks onto a plurality of physical CPUs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/5017Task decomposition

Definitions

  • the invention belongs to the technical field of big data processing, and in particular relates to a method and a device for dividing sub-blocks of large data random sampling data.
  • the processing method of big data processing systems such as Hadoop and Spark is divide-and-conquer (ie, block processing), and the original big data block is cut into several small data blocks for storage, and then corresponding Computation clusters processing transforms the analysis tasks of big data as a whole into multiple subtasks that can be processed in parallel.
  • the existing big data blocking processing method does not consider the probability distribution of these data blocks, and usually sequentially cuts one large data block into a plurality of data sub-blocks, and sequentially performs data cutting on the large data blocks. It is impossible to guarantee random sampling of the entire big data block; correspondingly, directly estimating the statistical characteristics of the entire big data or performing data analysis using the data sub-block will result in biased results.
  • the traditional random sampling method is: each time the entire large data block is scanned to obtain a random sampled data sub-block. If this method is used, the entire big data needs to be scanned each time in order to obtain a random sampled data sub-block. Blocks, as the size of big data blocks grows larger, the efficiency of this strategy drops dramatically.
  • each data sub-block is a random sample of the entire big data block, which becomes a basic problem faced by big data analysis.
  • we can perform statistical sampling that is, by processing the information obtained by the partial sub-data small blocks to approximate the information of the big data as a whole.
  • the invention provides a method and a device for dividing sub-blocks of large data random sampling data, and aims to propose a high-efficiency random sampling data blocking technique, which represents a large data block as a series of non-overlapping data sub-blocks, each data sub-block The block itself is a random sample of the entire big data block.
  • the present invention provides a method for dividing a sub-block of large data random sampling data, the method comprising:
  • Step S1 cutting a large data block to obtain P original data sub-blocks
  • Step S2 randomly extracting a plurality of pieces of data from each of the P data pieces, and combining a plurality of pieces of data extracted from each of the original data sub-blocks to generate a new random sample data piece.
  • Block repeating this step for a total of K times, obtaining K pieces of the random sampled data sub-block.
  • step S1 is specifically: sequentially cutting a large data block uniformly to obtain P original data sub-blocks, wherein each of the original data sub-blocks includes n pieces of data.
  • the big data block is a Hadoop distributed file system.
  • the present invention also provides a device for dividing a sub-block of large data random sample data, the device comprising:
  • Cutting module for cutting a large data block to obtain P original data sub-blocks
  • a random sampling module configured to randomly extract a plurality of pieces of data from each of the P pieces of the original data sub-blocks, and combine the pieces of data extracted from each of the original data sub-blocks to generate a new random Sampling the data sub-block; repeating this operation for a total of K times, obtaining K pieces of the random sample data sub-block.
  • the cutting module is specifically configured to: sequentially and uniformly cut one large data block to obtain P original data sub-blocks, wherein each of the original data sub-blocks includes n pieces of data.
  • the number of blocks; K P.
  • the big data block is a Hadoop distributed file system.
  • the present invention has the beneficial effects that: the method and device for dividing a sub-block of large data random sampling data provided by the present invention, first, cutting a large data block to obtain P original data sub-blocks; Then, a plurality of pieces of data are randomly fetched from each of the P original data sub-blocks, and a plurality of pieces of data fetched from each of the original data sub-blocks are combined to generate a new random sampled data sub-block; The K operations are extracted K times, and a total of K random sample data sub-blocks are obtained.
  • the present invention performs data segmentation and then data randomization to ensure that the obtained random sample data sub-blocks are large. Random sampling of data blocks; and, when obtaining each random sampled data sub-block, there is no need to traverse the entire large data block, thereby greatly improving efficiency.
  • FIG. 1 is a schematic flowchart of a method for dividing a sub-block of large data random sampling data according to an embodiment of the present invention
  • FIG. 2 is a schematic flowchart of another method for dividing a sub-block of large data random sampling data according to an embodiment of the present invention
  • FIG. 3 is a schematic flowchart of another method for dividing a sub-block of large data random sampling data according to an embodiment of the present invention
  • FIG. 4 is a schematic block diagram of a device for dividing a sub-block of large data random sample data according to an embodiment of the present invention.
  • the present invention provides a method and a device for dividing a sub-block of large data random sampling data.
  • the method first blocks the big data and then randomizes the data to obtain a new sub-block of random sampling data.
  • the new random sampled data sub-block itself is a random sample of the entire large data block, which has great advantages in execution efficiency, and does not need to traverse the entire large data block.
  • the following describes a method for dividing a sub-block of large data random sampling data, as shown in FIG. 1 , including:
  • Step S101 cutting a large data block to obtain P original data sub-blocks
  • sequential cutting which is a very common operation in the existing big data processing system; however, the embodiment of the present invention is not limited to sequential cutting, and may be cut by other methods.
  • a big data block has 100 pieces of data, numbered from 1 to 100, and needs to be cut into 10 original data sub-blocks. If it is cut sequentially, it is 1 to 10 as a piece, 11 to 20 is a piece, etc. It is cut into 10 original data sub-blocks by other methods, such as 1, 11, 21, 31 until 91 is a block, 2, 12, 22, 32 up to 92 is a block, and so on.
  • the big data block D is a distributed data set, for example, split into P data sub-blocks and distributed in a computing cluster using a distributed file system such as the Hadoop Distributed File System (HDFS).
  • a distributed file system such as the Hadoop Distributed File System (HDFS).
  • HDFS Hadoop Distributed File System
  • Step S102 randomly extracting a plurality of pieces of data from each of the P pieces of the original data sub-blocks, and combining a plurality of pieces of data taken out from each of the original data sub-blocks to generate a new random sampled data piece. Block; this operation is repeated for a total of K times to obtain K pieces of the random sampled data sub-block.
  • the data in the original data sub-block provided by the embodiment of the present invention is a record, and the number of records taken out from each of the original data sub-blocks is not necessarily the same, and no other records are used after the fetching.
  • each of the obtained K random sampled data sub-blocks is a random sample of the entire large data block D.
  • a method for dividing a sub-block of large data random sampling data generates a random sample data partition (RSDP: Random Sample Data Partition) on a computing cluster, and represents the big data block D as one.
  • RSDP Random Sample Data Partition
  • a series of random sampled data sub-blocks that do not overlap each other, each random sampled data sub-block itself is a random sample of the entire large data block D; in particular, in the embodiment of the present invention, when each random sampled data sub-block is obtained, The entire big data block is traversed, which greatly improves efficiency.
  • FIG. 2 a specific embodiment is introduced to introduce a method for dividing a sub-block of large data random sampling data, as shown in FIG. 2, including:
  • Step S201 sequentially cutting a large data block uniformly to obtain P original data sub-blocks, wherein each of the original data sub-blocks includes n pieces of data;
  • the data in the original data sub-block provided by the embodiment of the present invention is a record, and the number of records included in each of the original data sub-blocks obtained by uniform cutting is equal, and is n records.
  • Step S202 randomly extracting b pieces of data from each of the P original data sub-blocks, and combining b pieces of data extracted from each of the original data sub-blocks to generate a new random sampled data piece.
  • an equal number of b records are randomly fetched from each of the original data sub-blocks, and b records taken from each of the original data sub-blocks are combined to obtain a new random sampled data sub-block.
  • the new random sampled data sub-block obtained is the random sample of the entire large data block D.
  • K n/b, where n is the number of records included in each of the original data sub-blocks, and b is the number of records extracted from each of the original data sub-blocks each time; Extracting and combining a total of n/b times, n/b of the random sampled data sub-blocks are obtained.
  • a method for dividing sub-blocks of large data random sampling data obtained by an embodiment of the present invention obtains a series of randomly sampled data sub-blocks that do not overlap each other by uniformly dividing and uniformly extracting combinations, and on the one hand, can obtain the obtained
  • the data sub-block is a random sample of the entire big data block; on the other hand, there is a great advantage in execution efficiency, and there is no need to traverse the entire large data block.
  • FIG. 3 a specific embodiment is introduced to introduce a method for dividing a sub-block of large data random sampling data, as shown in FIG. 3, including:
  • Step S301 sequentially cutting a large data block uniformly to obtain P original data sub-blocks, wherein each of the original data sub-blocks includes n pieces of data;
  • the data in the original data sub-block provided by the embodiment of the present invention is a record, and the number of records included in each of the original data sub-blocks obtained by uniform cutting is equal, and is n records.
  • Step S302 randomly extracting b pieces of data from each of the P pieces of the original data sub-blocks, and combining b pieces of data extracted from each of the original data sub-blocks to generate a new random sampled data piece.
  • Block; repeating this operation a total of K times, obtaining K pieces of the random sample data sub-block; wherein b n/P.
  • n the number of data pieces included in each of the original data sub-blocks
  • P the number of original data sub-blocks
  • n/P records are randomly fetched from each of the original data sub-blocks, and n/P records taken out from each of the original data sub-blocks are combined to obtain a new random.
  • the data sub-block is sampled, and the new sub-block of the random sample data obtained is the random sample of the entire large block D.
  • K P; by extracting and combining a total of P times, P pieces of the random sampled data sub-blocks can be obtained.
  • the number of the random sampled data sub-blocks obtained by the method of extracting and combining is equal to the number of the original data sub-blocks obtained after the segmentation, and the number of records of each of the random sampled data sub-blocks It is also equal to the number of records of the original data sub-block.
  • a method for dividing a sub-block of large data random sampling data provides a high-efficiency random sampling data blocking technique, which divides a large data block D into K random sampling data sub-blocks, each of which is randomly sampled.
  • the data block contains n records; it can be guaranteed that the obtained random sample data sub-block is a random sample of the entire big data block; and there is a great advantage in execution efficiency, and it is not necessary to traverse the entire large data block.
  • a specific embodiment is introduced to introduce a device for dividing sub-blocks of large data random sampling data, as shown in FIG. 4, including:
  • the cutting module 401 is configured to cut a large data block to obtain P original data sub-blocks
  • the random sampling module 402 is configured to randomly take out a plurality of pieces of data from each of the P pieces of the original data sub-blocks, and combine the pieces of data extracted from each of the original data sub-blocks to generate a new one.
  • the data sub-block is randomly sampled; this operation is repeated for a total of K times to obtain K pieces of the random sample data sub-block.
  • the related content of the apparatus for dividing the sub-block of the large-data random sampling data may be specifically referred to the method for dividing the sub-block of the large-data random sampling data described in the embodiment shown in FIG. 1-3, and details are not described herein.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种大数据随机采样数据子块的划分方法,适用于大数据处理技术领域,包括:切割一个大数据块,得到P个原始数据子块(S101);从P个中的每一个所述原始数据子块中随机取出若干条数据,并把从每一个所述原始数据子块中取出的若干条数据组合,生成一个新的随机采样数据子块;重复提取组合的操作共K次,得到K个所述随机采样数据子块(S102)。该划分方法可以保证所得到的随机采样数据子块是整个大数据块的随机采样;并且,在得到各个随机采样数据子块时,并不需要对整个大数据块进行遍历,从而大大提高了效率。

Description

大数据随机采样数据子块的划分方法及装置
本发明属于大数据处理技术领域,尤其涉及一种大数据随机采样数据子块的划分方法及装置。
对于数据处理的任务,常规的传统数据分析的处理方式是直接处理全部数据;然而,随着数据量变大,直接处理全部数据在技术上不可行。
因此,一方面,Hadoop、Spark等的大数据处理系统的处理方式是divide-and-conquer (即分块处理),将原始的大数据块切成若干个小数据块存储,再由相应的各个计算集群(computing clusters)处理,从而将大数据整体的分析任务转变成多个可并行处理的子任务。然而,现有的这种大数据分块处理方法不考虑这些数据块的概率分布,通常顺序的把一个大数据块切割成多个数据子块,对大数据块做顺序切割得到的数据子块,没法保证是整个大数据块的随机采样;相应的,直接用数据子块估计整个大数据的统计特性或者做数据分析,将得到有偏差的结果。
另一方面,传统的随机采样方法为:每次扫描整个大数据块以得到一个随机采样数据子块,若采用这种方式,每次为了得到一个随机采样数据子块,都需要扫描整个大数据块,随着大数据块尺寸越来越大,这种策略的效率急剧下降。
所以,高效率的把一个大数据块切分成多个数据子块,每个数据子块是整个大数据块的随机采样,成为大数据分析面临的基础问题。有了随机采样数据子块,我们就能进行统计抽样,即通过处理部分子数据小块得到的信息去近似替代大数据整体的信息。
发明内容
本发明提供一种大数据随机采样数据子块的划分方法及装置,旨在提出高效率的随机采样数据分块技术,把大数据块表示成一系列互不重叠的数据子块,每个数据子块本身是整个大数据块的随机采样。
本发明提供了一种大数据随机采样数据子块的划分方法,所述方法包括:
步骤S1,切割一个大数据块,得到P个原始数据子块;
步骤S2,从P个中的每一个所述原始数据子块中随机取出若干条数据,并把从每一个所述原始数据子块中取出的若干条数据组合,生成一个新的随机采样数据子块;重复执行此步骤共K次,得到K个所述随机采样数据子块。
进一步地,所述步骤S1具体为:顺序的均匀切割一个大数据块,得到P个原始数据子块,其中,每个所述原始数据子块中包含n条数据。
进一步地,所述步骤S2中,所述若干条数据为b条数据,K=n/b,其中,n为每个所述原始数据子块中包含的数据条数,b为每次从每一个所述原始数据子块中提取出的数据条数。
进一步地,所述步骤S2中,所述若干条数据为b条数据;b=n/P,其中,n为每个所述原始数据子块中包含的数据条数,P为原始数据子块的个数;K=P。
进一步地,所述大数据块为Hadoop分布式文件系统。
本发明还提供了一种大数据随机采样数据子块的划分装置,所述装置包括:
切割模块,用于切割一个大数据块,得到P个原始数据子块;
随机采样模块,用于从P个中的每一个所述原始数据子块中随机取出若干条数据,并把从每一个所述原始数据子块中取出的若干条数据组合,生成一个新的随机采样数据子块;重复此操作共K次,得到K个所述随机采样数据子块。
进一步地,所述切割模块具体用于:顺序的均匀切割一个大数据块,得到P个原始数据子块,其中,每个所述原始数据子块中包含n条数据。
进一步地,所述随机采样模块中,所述若干条数据为b条数据,K=n/b,其中,n为每个所述原始数据子块中包含的数据条数,b为每次从每一个所述原始数据子块中提取出的数据条数。
进一步地,所述随机采样模块中,所述若干条数据为b条数据;b=n/P,其中,n为每个所述原始数据子块中包含的数据条数,P为原始数据子块的个数;K=P。
进一步地,所述大数据块为Hadoop分布式文件系统。
本发明与现有技术相比,有益效果在于:本发明提供的一种大数据随机采样数据子块的划分方法及装置,首先,将一个大数据块进行切割,得到P个原始数据子块;然后,从P个中的每一个原始数据子块中随机取出若干条数据,并把从每一个所述原始数据子块中取出的若干条数据组合,生成一个新的随机采样数据子块;重复提取组合操作K次,共得到K个随机采样数据子块;本发明与现有技术相比,先进行数据分块,再进行数据随机化,可以保证所得到的随机采样数据子块是整个大数据块的随机采样;并且,在得到各个随机采样数据子块时,不需要遍历扫描整个大数据块,从而大大提高了效率。
附图说明
图1是本发明实施例提供的一种大数据随机采样数据子块的划分方法的流程示意图;
图2是本发明实施例提供的另一种大数据随机采样数据子块的划分方法的流程示意图;
图3是本发明实施例提供的另一种大数据随机采样数据子块的划分方法的流程示意图;
图4是本发明实施例提供的一种大数据随机采样数据子块的划分装置的模块示意图。
具体实施方式
为了使本发明的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本发明进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本发明,并不用于限定本发明。
由于现有技术中,一方面存在每个数据子块本身无法保证是整个大数据块的随机采样的问题,另一方面存在每次得到一个随机采样数据子块都需要扫描整个大数据块,从而导致效率低的问题。
为了解决上述技术问题,本发明提出一种大数据随机采样数据子块的划分方法及装置,该方法先将大数据分块,再进行数据随机化,得到新的随机采样数据子块,这种新的随机采样数据子块本身是整个大数据块的随机采样,在执行效率上有很大的优势,不需要遍历扫描整个大数据块。
下面举一具体实施例介绍一种大数据随机采样数据子块的划分方法,如图1所示,包括:
步骤S101,切割一个大数据块,得到P个原始数据子块;
具体地,在实际操作中,通常是顺序的切割,在现有的大数据处理系统中,这是一个很常见的操作;但是本发明实施例不限于顺序的切割,也可以采用其它的办法切割。比如,假定一个大数据块有100条数据,编号为1到100,需要切割成10个原始数据子块,如果顺序切割,就是1到10为一块,11到20为一块,等等;也可以用其它办法切割成10个原始数据子块,比如1、11、21、31直到91为一块,2、12、22、32直到92为一块,等等。
通常大数据块D是一个分布式数据集,例如,用分布式文件系统例如Hadoop分布式文件系统(HDFS)切分成P个数据子块并且分布式放置在一个计算集群中。
步骤S102,从P个中的每一个所述原始数据子块中随机取出若干条数据,并把从每一个所述原始数据子块中取出的若干条数据组合,生成一个新的随机采样数据子块;重复执行此操作共K次,得到K个所述随机采样数据子块。
具体地,本发明实施例提供的所述原始数据子块中的数据为记录,从每个所述原始数据子块中取出记录的条数不一定是相同的,且取出之后不再用其它记录去替换,把从每个所述原始数据子块中取出的若干条记录组合,即可得到一个新的随机采样数据子块,而得到的这种新的所述随机采样数据子块即为整个大数据块D的随机采样。
具体地,得到的K个所述随机采样数据子块中的每一个皆为整个大数据块D的随机采样。
本发明实施例提供的一种大数据随机采样数据子块的划分方法,在一个计算集群(computing cluster)上生成一个随机采样数据划分(RSDP: Random Sample Data Partition),把大数据块D表示成一系列互不重叠的随机采样数据子块,每个随机采样数据子块本身是整个大数据块D的随机采样;特别的,本发明实施例在得到各个随机采样数据子块时,并不需要对整个大数据块进行遍历,从而大大提高了效率。
下面再举一具体实施例介绍一种大数据随机采样数据子块的划分方法,如图2所示,包括:
步骤S201,顺序的均匀切割一个大数据块,得到P个原始数据子块,其中,每个所述原始数据子块中包含n条数据;
具体地,本发明实施例提供的所述原始数据子块中的数据为记录,通过均匀切割,得到的每个所述原始数据子块中包含的记录条数是相等的,为n条记录。
步骤S202,从P个中的每一个所述原始数据子块中随机取出b条数据,并把从每一个所述原始数据子块中取出的b条数据组合,生成一个新的随机采样数据子块;重复执行此操作共K次,得到K个所述随机采样数据子块;其中,K=n/b。
具体地,从每个所述原始数据子块中随机取出相等的b条记录,把从每个所述原始数据子块中取出的b条记录组合,即可得到一个新的随机采样数据子块,而得到的这种新的所述随机采样数据子块即为整个大数据块D的随机采样。
具体地,K=n/b,其中,n为每个所述原始数据子块中包含的记录条数,b为每次从每一个所述原始数据子块中提取出的记录条数;通过提取并组合共n/b次,即可得到n/b个所述随机采样数据子块。
本发明实施例提供的一种大数据随机采样数据子块的划分方法,通过均匀分割并均匀提取组合的方式,得到一系列互不重叠的随机采样数据子块,一方面,可以保证所得到的数据子块是整个大数据块的随机采样;另一方面,在执行效率上有很大的优势,不需要遍历扫描整个大数据块。
下面再举一具体实施例介绍一种大数据随机采样数据子块的划分方法,如图3所示,包括:
步骤S301,顺序的均匀切割一个大数据块,得到P个原始数据子块,其中,每个所述原始数据子块中包含n条数据;
具体地,本发明实施例提供的所述原始数据子块中的数据为记录,通过均匀切割,得到的每个所述原始数据子块中包含的记录条数是相等的,为n条记录。
步骤S302,从P个中的每一个所述原始数据子块中随机取出b条数据,并把从每一个所述原始数据子块中取出的b条数据组合,生成一个新的随机采样数据子块;重复执行此操作共K次,得到K个所述随机采样数据子块;其中,b=n/P。
具体地,b=n/P,其中,n为每个所述原始数据子块中包含的数据条数,P为原始数据子块的个数。
具体地,从每个所述原始数据子块中随机取出相等的n/P条记录,把从每个所述原始数据子块中取出的n/P条记录组合,即可得到一个新的随机采样数据子块,而得到的这种新的所述随机采样数据子块即为整个大数据块D的随机采样。
具体地,K=P;通过提取并组合共P次,即可得到P个所述随机采样数据子块。通过这种提取组合的方式得到的所述随机采样数据子块的个数和分割之后得到的所述原始数据子块的个数相等,并且,每个所述随机采样数据子块的记录条数也和所述原始数据子块的记录条数相等。
本发明实施例提供的一种大数据随机采样数据子块的划分方法,提出高效率的随机采样数据分块技术,通过把大数据块D划分成K个随机采样数据子块,每个随机采样数据块包含n条记录;可以保证所得到的随机采样数据子块是整个大数据块的随机采样;并且在执行效率上有很大的优势,不需要遍历扫描整个大数据块。
下面再举一具体实施例介绍一种大数据随机采样数据子块的划分装置,如图4所示,包括:
切割模块401,用于切割一个大数据块,得到P个原始数据子块;
随机采样模块402,用于从P个中的每一个所述原始数据子块中随机取出若干条数据,并把从每一个所述原始数据子块中取出的若干条数据组合,生成一个新的随机采样数据子块;重复此操作共K次,得到K个所述随机采样数据子块。
需要说明的是,大数据随机采样数据子块的划分装置的相关内容具体可参阅图1-3所示实施例中描述的大数据随机采样数据子块的划分方法,此处不做赘述。
以上所述仅为本发明的较佳实施例而已,并不用以限制本发明,凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等,均应包含在本发明的保护范围之内。

Claims (10)

  1. 一种大数据随机采样数据子块的划分方法,其特征在于,所述方法包括:
    步骤S1,切割一个大数据块,得到P个原始数据子块;
    步骤S2,从P个中的每一个所述原始数据子块中随机取出若干条数据,并把从每一个所述原始数据子块中取出的若干条数据组合,生成一个新的随机采样数据子块;重复执行此步骤共K次,得到K个所述随机采样数据子块。
  2. 如权利要求1所述的大数据随机采样数据子块的划分方法,其特征在于,所述步骤S1具体为:顺序的均匀切割一个大数据块,得到P个原始数据子块,其中,每个所述原始数据子块中包含n条数据。
  3. 如权利要求2所述的大数据随机采样数据子块的划分方法,其特征在于,所述步骤S2中,所述若干条数据为b条数据,K=n/b,其中,n为每个所述原始数据子块中包含的数据条数,b为每次从每一个所述原始数据子块中提取出的数据条数。
  4. 如权利要求2所述的大数据随机采样数据子块的划分方法,其特征在于,所述步骤S2中,所述若干条数据为b条数据;b=n/P,其中,n为每个所述原始数据子块中包含的数据条数,P为原始数据子块的个数;K=P。
  5. 如权利要求1-4任一项所述的大数据随机采样数据子块的划分方法,其特征在于,所述大数据块为Hadoop分布式文件系统。
  6. 一种大数据随机采样数据子块的划分装置,其特征在于,所述装置包括:
    切割模块,用于切割一个大数据块,得到P个原始数据子块;
    随机采样模块,用于从P个中的每一个所述原始数据子块中随机取出若干条数据,并把从每一个所述原始数据子块中取出的若干条数据组合,生成一个新的随机采样数据子块;重复此操作共K次,得到K个所述随机采样数据子块。
  7. 如权利要求6所述的大数据随机采样数据子块的划分装置,其特征在于,所述切割模块具体用于:顺序的均匀切割一个大数据块,得到P个原始数据子块,其中,每个所述原始数据子块中包含n条数据。
  8. 如权利要求7所述的大数据随机采样数据子块的划分装置,其特征在于,所述随机采样模块中,所述若干条数据为b条数据,K=n/b,其中,n为每个所述原始数据子块中包含的数据条数,b为每次从每一个所述原始数据子块中提取出的数据条数。
  9. 如权利要求7所述的大数据随机采样数据子块的划分装置,其特征在于,所述随机采样模块中,所述若干条数据为b条数据;b=n/P,其中,n为每个所述原始数据子块中包含的数据条数,P为原始数据子块的个数;K=P。
  10. 如权利要求6-9任一项所述的大数据随机采样数据子块的划分装置,其特征在于,所述大数据块为Hadoop分布式文件系统。
PCT/CN2018/078509 2018-03-09 2018-03-09 大数据随机采样数据子块的划分方法及装置 WO2019169619A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2018/078509 WO2019169619A1 (zh) 2018-03-09 2018-03-09 大数据随机采样数据子块的划分方法及装置

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2018/078509 WO2019169619A1 (zh) 2018-03-09 2018-03-09 大数据随机采样数据子块的划分方法及装置

Publications (1)

Publication Number Publication Date
WO2019169619A1 true WO2019169619A1 (zh) 2019-09-12

Family

ID=67846877

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/078509 WO2019169619A1 (zh) 2018-03-09 2018-03-09 大数据随机采样数据子块的划分方法及装置

Country Status (1)

Country Link
WO (1) WO2019169619A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117421354A (zh) * 2023-12-19 2024-01-19 国家卫星海洋应用中心 一种卫星遥感大数据集统计方法、装置及设备

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5613091A (en) * 1992-12-22 1997-03-18 Sony Corporation Data compression
CN102750309A (zh) * 2012-03-19 2012-10-24 南京大学 一种基于Hadoop的并行化SVM求解方法
CN103336844A (zh) * 2013-07-22 2013-10-02 广西师范大学 大数据rd分割方法
CN103473255A (zh) * 2013-06-06 2013-12-25 中国科学院深圳先进技术研究院 一种数据聚类方法、系统及数据处理设备
CN105303456A (zh) * 2015-10-16 2016-02-03 国家电网公司 电力传输设备监控数据处理方法
CN106021567A (zh) * 2016-05-31 2016-10-12 中国农业大学 一种基于Hadoop的海量矢量数据划分方法及系统

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5613091A (en) * 1992-12-22 1997-03-18 Sony Corporation Data compression
CN102750309A (zh) * 2012-03-19 2012-10-24 南京大学 一种基于Hadoop的并行化SVM求解方法
CN103473255A (zh) * 2013-06-06 2013-12-25 中国科学院深圳先进技术研究院 一种数据聚类方法、系统及数据处理设备
CN103336844A (zh) * 2013-07-22 2013-10-02 广西师范大学 大数据rd分割方法
CN105303456A (zh) * 2015-10-16 2016-02-03 国家电网公司 电力传输设备监控数据处理方法
CN106021567A (zh) * 2016-05-31 2016-10-12 中国农业大学 一种基于Hadoop的海量矢量数据划分方法及系统

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117421354A (zh) * 2023-12-19 2024-01-19 国家卫星海洋应用中心 一种卫星遥感大数据集统计方法、装置及设备
CN117421354B (zh) * 2023-12-19 2024-03-19 国家卫星海洋应用中心 一种卫星遥感大数据集统计方法、装置及设备

Similar Documents

Publication Publication Date Title
US11200258B2 (en) Systems and methods for fast and effective grouping of stream of information into cloud storage files
WO2020073687A1 (zh) 流式数据列存储方法、装置、设备和存储介质
US20180248934A1 (en) Method and System for a Scheduled Map Executor
CN113360554B (zh) 一种数据抽取、转换和加载etl的方法和设备
CN104615736A (zh) 基于数据库的大数据快速解析存储方法
CN106339484A (zh) 一种视频智能检索处理的系统及方法
CN105763886A (zh) 一种分布式转码方法和装置
CN108132986B (zh) 一种飞行器海量传感器试验数据的快速处理方法
US20140108323A1 (en) Compressively-accelerated read mapping
JP6313864B2 (ja) ストリームデータの処理方法及びストリームデータ処理装置
CN106990914B (zh) 数据删除方法及装置
CN107004022B (zh) 数据分割和变换方法与装置
WO2019169619A1 (zh) 大数据随机采样数据子块的划分方法及装置
CN108491476A (zh) 大数据随机采样数据子块的划分方法及装置
Barbuzzi et al. Parallel bulk Insertion for large-scale analytics applications
WO2023124135A1 (zh) 特征检索方法、装置、电子设备、计算机存储介质和程序
CN104484174B (zh) Rar格式的压缩文件的处理方法和装置
CN111104384A (zh) 数据预处理方法、装置、设备和存储介质
CN110704407A (zh) 一种数据去重的方法和系统
CN108121745B (zh) 一种数据加载方法和装置
WO2015143708A1 (zh) 后缀数组的构造方法及装置
CN107544090B (zh) 一种基于MapReduce的地震数据解析存储方法
CN109597795B (zh) 一种路基压实施工数据高效处理系统
US9275168B2 (en) Hardware projection of fixed and variable length columns of database tables
CN106777262B (zh) 高通量测序数据质量过滤方法和过滤装置

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 25.01.2021)

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18908428

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 18908428

Country of ref document: EP

Kind code of ref document: A1