CN110308998B - Mass data sampling method and device - Google Patents

Mass data sampling method and device Download PDF

Info

Publication number
CN110308998B
CN110308998B CN201910625106.3A CN201910625106A CN110308998B CN 110308998 B CN110308998 B CN 110308998B CN 201910625106 A CN201910625106 A CN 201910625106A CN 110308998 B CN110308998 B CN 110308998B
Authority
CN
China
Prior art keywords
sampling
data
window
data block
proportion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910625106.3A
Other languages
Chinese (zh)
Other versions
CN110308998A (en
Inventor
杨宇
方佩
冯仁伟
宋立锵
郑宏雄
岳靖雨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Comservice Enrising Information Technology Co Ltd
Original Assignee
China Comservice Enrising Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Comservice Enrising Information Technology Co Ltd filed Critical China Comservice Enrising Information Technology Co Ltd
Priority to CN201910625106.3A priority Critical patent/CN110308998B/en
Publication of CN110308998A publication Critical patent/CN110308998A/en
Application granted granted Critical
Publication of CN110308998B publication Critical patent/CN110308998B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5083Techniques for rebalancing the load in a distributed system

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Other Investigation Or Analysis Of Materials By Electrical Means (AREA)

Abstract

The application discloses a massive data sampling method, which can determine the sampling proportion of a current HDFS data block in the data sampling process, divide the current HDFS data block into data of a plurality of sampling windows according to the sampling proportion, finally call a sampling process to sample the data in a target sampling window to obtain a sampling result, and set a reading completion mark for the current HDFS data block. Therefore, the method sets the sampling proportion of the data aiming at the problem that MapReduce wastes a large amount of time to read irrelevant data, divides the data block into a plurality of windows according to the sampling proportion in the data sampling process, randomly collects the data in a certain window, skips over irrelevant data lines, and obviously improves the data sampling and processing efficiency. In addition, the application also provides a mass data sampling device, a load server and a distributed processing system, and the functions of the device and the system correspond to the method.

Description

Mass data sampling method and device
Technical Field
The present application relates to the field of distributed big data processing, and in particular, to a method and an apparatus for sampling mass data, a load server, and a distributed processing system.
Background
The existing distributed big data processing scheme can be roughly divided into two types, namely Hadoop MapReduce and Apache Spark.
The MapReduce is a parallel extensible computing model, has better fault tolerance and mainly solves the problem of batch processing of massive offline data, the two most main processes of the MapReduce are Map and Reduce, and the Map has the functions of reading one or more lines of data at a time, mapping and converting the data and outputting an intermediate result; reduce is mainly used to merge intermediate results of all maps.
Apache Spark is an open source cluster computing environment similar to Hadoop, but Spark is superior in certain workload aspects, Spark uses a specially designed data structure, namely elastic Distributed data sets RDD (resource Distributed data sets), can provide interactive query and optimization iteration workload, and the elastic Distributed data sets can be reused in a parallel environment. Spark implements an efficient Directed Acyclic Graph (DAG) execution engine, which can process the same data set at a faster speed with the help of mechanisms such as memory computation policies and advanced DAG scheduling, and can efficiently process data streams by memory-based.
Limitations of MapReduce algorithm: the data processing model of MapReduce is that only one line or a plurality of lines of data can be processed at a time until all data lines of the whole file are processed, and the algorithm must read all data, so that a large amount of resources are wasted on irrelevant data, and the data processing efficiency is low.
Limitations of Spark algorithm: the method has the advantages that the method must be operated on a large cluster with a large amount of configured memory, if the method is deployed in a shared cluster, the problem of insufficient resources may be encountered, compared with Hadoop MapReduce, the resource consumption of Spark is larger, the cost is higher, and other tasks using the cluster at the same time may be affected.
In summary, both MapReduce and Spark algorithms have the capability of processing data files, but MapReduce has low data processing efficiency, and Spark occupies too much memory resources. Therefore, how to improve the processing efficiency of mass data needs to be solved urgently by the technical personnel in the field.
Disclosure of Invention
The application aims to provide a method and a device for sampling mass data, a load server and a distributed processing system, which are used for solving the problem that the traditional scheme has low processing efficiency on the mass data. The specific scheme is as follows:
in a first aspect, the present application provides a method for sampling mass data, which is applied to a load server in a distributed processing system, and includes:
determining the sampling proportion of the current HDFS data block;
dividing the current HDFS data block into data of a plurality of sampling windows according to the sampling proportion;
and calling a sampling process to sample data in a target sampling window to obtain a sampling result, and setting a reading completion mark for the current HDFS data block, wherein the target sampling window is any one of the plurality of sampling windows.
Optionally, the invoking a sampling process samples data in a target sampling window to obtain a sampling result, including:
controlling a pre-created sampling pointer to move from the initial position of the current HDFS data block to the first row data position of the target sampling window;
and moving the sampling pointer line by line, and calling a sampling process to sample the data line pointed by the sampling pointer until the sampling pointer moves to the last line data position of the target sampling window to obtain a sampling result.
Optionally, the invoking a sampling process samples a data line pointed to by the sampling pointer, including:
and calling a sampling process to sample the target byte of the data line pointed by the sampling pointer.
Optionally, the setting a read-complete flag for the current HDFS data block includes:
and after the sampling result is obtained, moving the sampling pointer to the tail part of the current HDFS data block to be used as a reading completion mark.
Optionally, before the sampling process is invoked to sample data in the target sampling window and obtain a sampling result, the method further includes:
obtaining a random number according to the evenly distributed random function;
and determining a sampling window corresponding to the random number in the plurality of sampling windows as a target sampling window.
Optionally, the dividing the current HDFS data block into data of a plurality of sampling windows according to the sampling ratio includes:
determining the window number and the window size of a sampling window according to the sampling proportion and the size of the current HDFS data block;
and dividing the current HDFS data block into data of sampling windows of the number of the windows according to the size of the window.
Optionally, after the sampling process is invoked to sample data in the target sampling window and obtain a sampling result, the method further includes:
judging whether the first row data and the last row data of the sampling result are complete or not;
and if the data is not complete, performing integrity processing on the first line data and/or the last line data.
In a second aspect, the present application provides a device for sampling mass data, including:
a sampling ratio determination module: the method comprises the steps of determining the sampling proportion of a current HDFS data block;
a dividing module: the data used for dividing the current HDFS data block into a plurality of sampling windows according to the sampling proportion;
a sampling module: and the HDFS reading device is used for calling a sampling process to sample data in a target sampling window to obtain a sampling result and setting a reading completion mark for the current HDFS data block, wherein the target sampling window is any one of the plurality of sampling windows.
In a third aspect, the present application provides a load server, including:
a memory: for storing a computer program;
a processor: for executing said computer program for implementing the steps of a method for sampling mass data as described above.
In a fourth aspect, the present application provides a distributed processing system, comprising:
a client: the central server is used for sending a processing request to the central server;
the central server: the load server is used for decomposing the processing request into a plurality of sub-requests and respectively sending the sub-requests to the load servers corresponding to the sub-requests one by one;
the load server: the system comprises a central server and a plurality of sampling windows, wherein the central server is used for responding to the sub-requests to determine the sampling proportion of the current HDFS data block, dividing the current HDFS data block into data of a plurality of sampling windows according to the sampling proportion, sampling the data in a target sampling window, setting a reading completion flag for the current HDFS data block when the sampling is completed, and sending a sampling result to the central server so as to process the sampling result, wherein the target sampling window is any one of the plurality of sampling windows.
The method for sampling the mass data is applied to a load server in a distributed processing system, can determine the sampling proportion of a current HDFS data block in the data sampling process, divides the current HDFS data block into data of a plurality of sampling windows according to the sampling proportion, finally calls a sampling process to sample the data in a target sampling window to obtain a sampling result, and sets a reading completion flag for the current HDFS data block, wherein any one of the plurality of sampling windows in the target sampling window. Therefore, the method sets the sampling proportion of the data aiming at the problem that MapReduce wastes a large amount of time to read irrelevant data, divides the data block into a plurality of windows according to the sampling proportion in the data sampling process, randomly collects the data in a certain window, skips over irrelevant data lines, and obviously improves the efficiency of data sampling and processing.
In addition, the application also provides a sampling device, a load server and a distributed processing system of mass data, the functions of which correspond to the method, and are not described again.
Drawings
For a clearer explanation of the embodiments or technical solutions of the prior art of the present application, the drawings needed for the description of the embodiments or prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
Fig. 1 is a flowchart illustrating a first implementation of a method for sampling mass data according to an embodiment of the present disclosure;
fig. 2 is a flowchart illustrating an implementation of a second embodiment of a method for sampling mass data provided in the present application;
fig. 3 is a schematic diagram of modeling analysis of a third embodiment of a method for sampling mass data provided by the present application;
fig. 4 is a distribution diagram of analysis results of random samples of full data and different sampling ratios in a third embodiment of a method for sampling mass data provided by the present application;
fig. 5 to 8 are distribution diagrams of support degrees of a full-scale sample, a 10% sampling proportion sample, a 20% sampling proportion sample, and a 50% sampling proportion sample in a third embodiment of the method for sampling mass data provided in the present application, respectively;
FIG. 9 is a functional block diagram of an embodiment of a mass data sampling apparatus provided in the present application;
fig. 10 is a schematic structural diagram of a load server provided in the present application;
fig. 11 is a schematic diagram of an architecture of a distributed processing system according to the present application.
Detailed Description
In order that those skilled in the art will better understand the disclosure, the following detailed description will be given with reference to the accompanying drawings. It is to be understood that the embodiments described are only a few embodiments of the present application and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
At present, although both MapReduce and Spark algorithms have the capacity of processing data files, MapReduce has low data processing efficiency, and Spark occupies too large memory resources, so that the data processing efficiency is low. Aiming at the problem, the application provides a method and a device for sampling mass data, a load server and a distributed processing system, and the efficiency of data sampling and processing is obviously improved.
Referring to fig. 1, a first embodiment of a method for sampling mass data provided by the present application is described below, where the first embodiment is applied to a load server in a distributed processing system, and the first embodiment includes:
step S101: determining the sampling proportion of the current HDFS data block;
the current HDFS data block refers to an HDFS data block to be currently acquired in a data sampling process of a load server, and an HDFS is a Hadoop Distributed File System, and it is worth mentioning that in an actual application scenario, the load server may need to acquire a plurality of HDFS data blocks.
The sampling proportion may be set manually, and may be determined according to the current task requirement and the scene requirement, for example, the respective sampling proportion of the multiple application scene data may be set in advance, and before sampling, the corresponding sampling proportion may be queried according to the current application scene to serve as a basis for a subsequent sampling process.
Step S102: dividing the current HDFS data block into data of a plurality of sampling windows according to the sampling proportion;
specifically, the window size and the number of windows of the sampling window are determined according to the sampling proportion, and the current HDFS data block is divided into the data of the sampling window with the number of windows. Specifically, the window number may be a derivative of a sampling ratio, the window size may be a product of a size of the current HDFS data block and the sampling ratio, and reading data in one sampling window may implement a sampling process of the sampling ratio of the current HDFS data block.
Step S103: and calling a sampling process to sample data in a target sampling window to obtain a sampling result, and setting a reading completion mark for the current HDFS data block, wherein the target sampling window is any one of the plurality of sampling windows.
In this embodiment, the sampling processes are in one-to-one correspondence with the HDFS data blocks, and therefore, in this embodiment, the sampling process corresponding to the current HDFS data block is specifically called to sample data falling into the target sampling window, a specific sampling manner may be line-by-line sampling or multi-line sampling, and a sampling object may be all data in the target sampling window or specific data in the target window, for example, a target byte of each data line. The target sampling window is any one of the plurality of sampling windows, so that the target sampling window can be obtained by random selection.
As described above, in this embodiment, after the sampling of the data in the target sampling window is completed, the read-out flag is set for the current HDFS data block, so that the sampling process is prevented from sampling the HDFS data block line by line or multiple lines again in the subsequent process, and the purpose of skipping a large amount of irrelevant data outside the target sampling window is achieved.
The embodiment provides a method for sampling mass data, which is applied to a load server in a distributed processing system, and during a data sampling process, the method can determine a sampling proportion of a current HDFS data block, divide the current HDFS data block into data of a plurality of sampling windows according to the sampling proportion, finally invoke a sampling process to sample the data in a target sampling window to obtain a sampling result, and set a reading completion flag for the current HDFS data block, wherein any one of the plurality of sampling windows in the target sampling window. Therefore, the method sets the sampling proportion of the data aiming at the problem that MapReduce wastes a large amount of time to read irrelevant data, divides the data block into a plurality of windows according to the sampling proportion in the data sampling process, randomly collects the data in a certain window, skips over irrelevant data lines, and obviously improves the efficiency of data sampling and processing.
An embodiment of a mass data sampling method provided by the present application is described in detail below, and the embodiment two is implemented based on the above embodiment one and is expanded to a certain extent on the basis of the embodiment one.
The second embodiment is applied to a load server in a distributed processing system, and referring to fig. 2, the second embodiment specifically includes:
step S201: determining the sampling proportion of the current HDFS data block;
the method is mainly used for obtaining initialization parameters in the sampling process, and the initialization parameters can also comprise the size of a buffer area, an operating memory, an input file directory, an output file directory, the size of an HDFS data block, the number of continuous samples and the like besides the sampling proportion, wherein the buffer area mainly refers to a storage area for storing sampling results.
In this embodiment, one sampling process processes one HDFS data block. Before sampling, the sampling process may obtain a sampling ratio of the current HDFS data block, and may also obtain sampled position information of the current HDFS data block. In the subsequent process, whether the current HDFS data block can be continuously sampled or not can be judged according to the sampled position information, if yes, the sampling process is executed, and if not, the sampling process is quitted.
Step S202: determining the window number and the window size of a sampling window according to the sampling proportion and the size of the current HDFS data block;
in this embodiment, an integer number N of sampling windows is calculated according to a sampling ratio, where INT () represents an integer part of a number in parentheses, INT (1/sampling ratio) is an optional sampling window except for a sampled sampling window; the sampling window refers to a data block formed by continuous data lines in one HDFS block of the data file, and the size of the data window is the product of the size of the HDFS data block and the sampling proportion.
Step S203: dividing the current HDFS data block into data of sampling windows of the number of the windows according to the size of the window;
step S204: obtaining a random number according to the evenly distributed random function; determining a sampling window corresponding to the random number in the plurality of sampling windows as a target sampling window;
step S205: controlling a pre-created sampling pointer to move from the initial position of the current HDFS data block to the first row data position of the target sampling window;
to avoid wasting resources with irrelevant data, the present embodiment skips the data lines in front of the target sampling window. Specifically, the present embodiment invokes a conversion algorithm to convert the target sampling window position into the data line number in the HDFS data block, and moves the sampling pointer from the start position of the HDFS data block to the first data line head of the target sampling window.
Step S206: and moving the sampling pointer line by line, and calling a sampling process to sample the data line pointed by the sampling pointer until the sampling pointer moves to the last line data position of the target sampling window to obtain a sampling result.
In the sampling process, the embodiment reads one row of functions or reads the data of the target byte at a time to improve the reading performance, and the reading is cycled until the data of the target sampling window is completely read. The target byte may be a preset byte most related to the user, and the specific byte may be set according to actual requirements.
It should be noted that in an actual application scenario, the termination condition of the sampling process may be, as described above, to terminate sampling when the last line of data in the target sampling window is sampled, or to stop sampling when the buffer is full of data. Specifically, in the sampling process, one line of data is not read, and the operation of judging whether the buffer area is full is executed once. The buffer area refers to a storage area applied for data obtained by sampling in the sampling process.
Step S207: after the sampling result is obtained, moving the sampling pointer to the tail of the current HDFS data block to be used as a reading completion mark;
as described above, in this embodiment, after the sampling of the current HDFS data block is completed, a read-complete flag is set for the HDFS data block. Specifically, in consideration of the fact that the conventional data line processing method determines whether to continue reading the data in the HDFS data block according to the relationship between the position of the sampling pointer of the HDFS data block and the last position of the HDFS data block, in this embodiment, after the sampling is completed, the sampling file pointer is placed at the tail of the HDFS data block to indicate that the data in the HDFS data block is no longer read.
Step S208: judging whether the first line data and the last line data of the sampling result are complete, if so, entering a step S210, otherwise, entering a step S209;
step S209: carrying out integrity processing on the first line data and/or the last line data;
specifically, the integrity processing may be to complete an incomplete data row, or may be to directly discard an incomplete data row, which is determined according to actual requirements.
Step S210: and storing the sampling result of the sampling process and the sampling results of other sampling processes into a target directory of the non-output file directory.
In the embodiment, the problem of data coverage may occur when the MapReduce program framework outputs the sampled data file at a fixed position, so that as a preferred embodiment, the embodiment transfers the sampled data file to a directory different from the directory of the output file, and ensures that the data file is not covered by a subsequent sampled output file.
In summary, according to the method for sampling mass data provided by this embodiment, the size of the sampling data window is calculated according to a given sampling ratio; recording the selectable sampling window of each HDFS data block under the designated sampling proportion, and selecting a non-repeated sampling window when each sampling process is started to realize random sampling of data without being put back. Specifically, in the embodiment, a sampling position is converted into a sampling pointer, and the sampling pointer is moved to a target sampling window from the start position of the HDFS data block where the sampling pointer is located, so that rapid positioning of a sampling data line is realized; and circularly reading the data lines until the data of the target sampling window is completely read, and moving the file pointer to the tail of the HDFS block to be used as a read-out completion mark.
Therefore, the present embodiment overcomes the defect that the random sampling processing of mass data in the existing scheme is not efficient enough, and provides a new large data random sampling processing scheme, and compared with the original MapReduce data line processing algorithm, under the same hardware and data amount conditions, the random sampling efficiency of the present embodiment is T times (T is greater than or equal to 1) of MapReduce, taking the example of randomly sampling 1G of data from a file with the size of 100G, the random sampling efficiency of the present embodiment is 100 times of MapReduce.
A third embodiment of the method for sampling mass data provided by the present application is described in detail below. In the third embodiment, data of different proportions are randomly sampled from a chain store sales data set, full data and sample data obtained by sampling are respectively modeled and analyzed, the similarity of modeling analysis results based on different data volumes is calculated, and the availability of the data obtained by random sampling is evaluated; and evaluating the execution efficiency of the random sampling algorithm according to the time consumption of the sampling algorithm on the large data set.
The main process of modeling analysis is shown in fig. 3, the data used in this example is a sales data detail of about 22G after desensitization, and each row of data records information of a certain customer purchasing a commodity, including information of ID, date, store number, department, commodity catalog, specification and model, brand, manufacturer, metering unit, quantity, amount, etc.
The data sampling process in this embodiment is described in detail below:
first, in the process of initializing parameters, the values of the parameters according to the size of the data file are as follows:
mapred.child.java.opts=-Xmx16384m
mapred.map.child.java.opts=-Xmx16384m
mapred.reduce.child.java.opts=-Xmx16384m
mapreduce.task.io.sort.mb=160
second, the HDFS data block size in this example is 128M, calculated with a sampling ratio of 10%, and the data window size is WS128M 10% 12M 8, and the number of data windows in each HDFS block is W n100 ═ 1 (10%). 022G size data file is divided into 22 × 1024/128 × 176 logical blocks, and 10% of the data sampled from the 22G data file has 10 total176The sample file with the size of 2.2G obtained by each sampling is 10176One of a combination.
Thirdly, according to the calculated number 10 of data windows, a random number R ranging from 0 to 9 is generated by using a uniformly distributed random function, and the foremost R12.8M irrelevant data in the current HDFS block is skipped. Specifically, the sampling program invokes the same algorithm to skip over extraneous data when processing each HDFS block.
And fourthly, reading data from the position R12.8M +1, and reading the data to the end of the current HDFS data block or to the end of the current HDFS data block. And after reading a data window, setting a block reading completion mark to mark that the data reading of the current HDFS block is completed.
And fifthly, combining the data of each block from the HDFS into a file, namely completing data sampling. For management, the data sampling program is allowed to automatically delete the directory of the output file, that is, the directory of the output file of the same data sampling program is run for many times, and only the last output can be left. Transferring the output data file to different directories, wherein the generation of a new file name by data transfer comprises the following steps: source file name-sample ratio-sequence number. dsp, where the source file is the file from which the data was extracted; the sequence number is a self-increasing natural number. The last file name is similar: data-0.1-1.dsp, data-0.1-2.dsp, data-0.1-n.dsp.
The following describes in detail the analysis process of the data sampling result in this embodiment:
in the embodiment, the FP-growth algorithm of the association rule is adopted to carry out full-scale sample frequent item sets, and then the similarity of the frequent item sets under different data volumes and the similarity of the recommended item sets are compared. Association rules are a simple, practical analytical technique to discover associations or correlations that exist in a large number of data sets, describing the laws and patterns in which certain attributes of an object appear simultaneously. The following introduces the relevant concepts involved in the analysis:
transaction: each transaction is referred to as a transaction;
item (1): each item of the transaction is referred to as an item, e.g., Cola, Egg, etc.;
item set: a set containing zero or more items is called an item set, e.g., { Cola, Egg, Ham };
k-item set: the term set containing k terms is called a k-term set, e.g., { Cola } is called a 1-term set, and { Cola, Egg } is called a 2-term set;
the support degree is as follows: the total number of occurrences of an item set is divided by the support count divided by the total number of transactions. For example, the total number of transactions is 4, and the occurrence of { Diaper, Beer } is 3, so its support is 3 ÷ 4 ═ 75%, indicating that 75% of the people have bought Diaper and Beer at the same time;
frequent item set: the item set with the support degree larger than or equal to a certain threshold value is called a frequent item set;
confidence coefficient: a confidence (P (B | a) ═ P (ab)/P (a)) means a probability of occurrence of an event B based on occurrence of an event a;
the data preprocessing is to merge the data, and the preprocessed data storage model is as follows: ID. Date, item list (list of all items purchased by the customer on a certain day without duplication). This list of items is the set of items in the association rule algorithm.
And (3) setting support degree of relevant parameters of the association rule: support ═ 0.2, confidence: the confidence is 0.6, and the similarity of the analysis results of the random samples with different sampling ratios of the 22G full-scale data is shown in table 1:
TABLE 1
Sampling ratio Similarity (support) Similarity (confidence)
0.01% 69.64% 59.98%
0.02% 77.77% 71.58%
0.03% 83.51% 74.65%
0.04% 84.82% 78.13%
0.05% 87.39% 84.63%
0.06% 89.31% 85.60%
0.07% 90.77% 89.29%
0.08% 89.86% 86.32%
0.09% 85.30% 77.63%
0.10% 92.03% 89.98%
The distribution diagram of the analysis results of the 22G full-scale data and the random samples with different sampling ratios is shown in fig. 4, and as shown in the upper distribution diagram, when the sampling ratio p is greater than 4 parts per million (0.04%), the similarity (including the support degree and the confidence degree) between the random samples and the full-scale samples tends to be smooth. Randomly selecting 10 commodities, and calculating the support degrees of the commodities in the full-scale sample and the support degrees of the commodities in the samples with different proportions, wherein the distribution graph of the commodities in the samples is shown in FIG. 5. The results of the sampling ratios of 10%, 20% and 50% are shown in fig. 6, 7 and 8. It can be seen that the distribution of the commodities in the random samples is consistent with that in the full data, the analysis result similar to the full data can be achieved by using fewer random samples, the data volume participating in calculation is greatly reduced by random sampling, and the speed of data processing and analysis is effectively improved.
In the following, a mass data sampling apparatus provided in an embodiment of the present application is introduced, and a mass data sampling apparatus described below and a mass data sampling method described above may be referred to correspondingly.
As shown in fig. 9, the apparatus includes:
the sampling ratio determination module 901: the method comprises the steps of determining the sampling proportion of a current HDFS data block;
a dividing module 902: the data used for dividing the current HDFS data block into a plurality of sampling windows according to the sampling proportion;
the sampling module 903: and the HDFS reading device is used for calling a sampling process to sample data in a target sampling window to obtain a sampling result and setting a reading completion mark for the current HDFS data block, wherein the target sampling window is any one of the plurality of sampling windows.
The mass data sampling apparatus of this embodiment is used to implement the foregoing mass data sampling method, and therefore a specific implementation of the apparatus may refer to the foregoing embodiments of the mass data sampling method, for example, the sampling ratio determining module 901, the dividing module 902, and the sampling module 903 are respectively used to implement steps S101, S102, and S103 in the foregoing mass data sampling method. Therefore, specific embodiments thereof may be referred to in the description of the corresponding respective partial embodiments, and will not be described herein.
In addition, since the sampling apparatus for mass data of this embodiment is used to implement the aforementioned method for sampling mass data, its role corresponds to that of the aforementioned method, and is not described herein again.
In addition, the present application also provides a load server, as shown in fig. 10, including:
the memory 100: for storing a computer program;
the processor 200: for executing said computer program for implementing the steps of a method for sampling mass data as described above.
The load server of this embodiment is used to implement the foregoing method for sampling mass data, so that a specific implementation of the load server may be found in the foregoing embodiment section of the method for sampling mass data, and its role corresponds to that of the foregoing method, and is not described here again.
Finally, the present application provides a distributed processing system, as shown in fig. 11, comprising:
the client 111: the central server is used for sending a processing request to the central server;
the central server 112: the load server is used for decomposing the processing request into a plurality of sub-requests and respectively sending the sub-requests to the load servers corresponding to the sub-requests one by one;
the load server 113: the system comprises a central server and a plurality of sampling windows, wherein the central server is used for responding to the sub-requests to determine the sampling proportion of the current HDFS data block, dividing the current HDFS data block into data of a plurality of sampling windows according to the sampling proportion, sampling the data in a target sampling window, setting a reading completion flag for the current HDFS data block when the sampling is completed, and sending a sampling result to the central server so as to process the sampling result, wherein the target sampling window is any one of the plurality of sampling windows.
The distributed processing system of this embodiment is used to implement the foregoing method for sampling mass data, so that a specific implementation of the distributed processing system may be found in the foregoing embodiment of the method for sampling mass data, and its role corresponds to that of the foregoing method, and details are not described here.
The embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The above detailed descriptions of the solutions provided in the present application, and the specific examples applied herein are set forth to explain the principles and implementations of the present application, and the above descriptions of the examples are only used to help understand the method and its core ideas of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims (10)

1. A method for sampling mass data is applied to a load server in a distributed processing system, and comprises the following steps:
determining the sampling proportion of the current HDFS data block;
dividing the current HDFS data block into data of a plurality of sampling windows according to the sampling proportion;
calling a sampling process to sample data in a target sampling window to obtain a sampling result, and setting a reading completion mark for the current HDFS data block, wherein the target sampling window is any one of the plurality of sampling windows;
the dividing the current HDFS data block into data of a plurality of sampling windows according to the sampling ratio includes: determining the window size and the window number of a sampling window according to the sampling proportion, and dividing the current HDFS data block into data of a plurality of sampling windows according to the window size and the window number, wherein the window number is the reciprocal of the sampling proportion, and the window size is the product of the size of the current HDFS data block and the sampling proportion;
the sampling data in the target sampling window comprises: data within the target sampling window is read.
2. The method of claim 1, wherein said invoking a sampling process to sample data within a target sampling window to obtain a sampling result comprises:
controlling a pre-created sampling pointer to move from the initial position of the current HDFS data block to the first row data position of the target sampling window;
and moving the sampling pointer line by line, and calling a sampling process to sample the data line pointed by the sampling pointer until the sampling pointer moves to the last line data position of the target sampling window to obtain a sampling result.
3. The method of claim 2, wherein invoking the sampling process to sample the line of data pointed to by the sampling pointer comprises:
and calling a sampling process to sample the target byte of the data line pointed by the sampling pointer.
4. The method of claim 2, wherein setting a read complete flag for the current HDFS data block comprises:
and after the sampling result is obtained, moving the sampling pointer to the tail part of the current HDFS data block to be used as a reading completion mark.
5. The method of claim 1, wherein before the invoking sampling process samples data within the target sampling window to obtain a sampling result, further comprising:
obtaining a random number according to the evenly distributed random function;
and determining a sampling window corresponding to the random number in the plurality of sampling windows as a target sampling window.
6. The method of claim 1, wherein said dividing the current HDFS block of data into data for a plurality of sampling windows according to the sampling ratio comprises:
determining the window number and the window size of a sampling window according to the sampling proportion and the size of the current HDFS data block;
and dividing the current HDFS data block into data of sampling windows of the number of the windows according to the size of the window.
7. The method of any one of claims 1-6, wherein after the invoking sampling process samples data within the target sampling window to obtain a sampling result, further comprising:
judging whether the first row data and the last row data of the sampling result are complete or not;
and if the data is not complete, performing integrity processing on the first line data and/or the last line data.
8. A mass data sampling apparatus, comprising:
a sampling ratio determination module: the method comprises the steps of determining the sampling proportion of a current HDFS data block;
a dividing module: the data used for dividing the current HDFS data block into a plurality of sampling windows according to the sampling proportion;
a sampling module: the HDFS data block reading device is used for calling a sampling process to sample data in a target sampling window to obtain a sampling result and setting a reading completion mark for the current HDFS data block, wherein the target sampling window is any one of the plurality of sampling windows;
the dividing module is configured to: determining the window size and the window number of a sampling window according to the sampling proportion, and dividing the current HDFS data block into data of a plurality of sampling windows according to the window size and the window number, wherein the window number is the reciprocal of the sampling proportion, and the window size is the product of the size of the current HDFS data block and the sampling proportion;
the sampling module is configured to: data within the target sampling window is read.
9. A load server, comprising:
a memory: for storing a computer program;
a processor: for executing said computer program for carrying out the steps of a method for sampling mass data according to any one of claims 1 to 7.
10. A distributed processing system, comprising:
a client: the central server is used for sending a processing request to the central server;
the central server: the load server is used for decomposing the processing request into a plurality of sub-requests and respectively sending the sub-requests to the load servers corresponding to the sub-requests one by one;
the load server: the system comprises a central server and a plurality of sampling windows, wherein the central server is used for responding to the sub-requests to determine the sampling proportion of a current HDFS data block, dividing the current HDFS data block into data of a plurality of sampling windows according to the sampling proportion, sampling the data in a target sampling window, setting a reading completion flag for the current HDFS data block when the sampling is completed, and sending a sampling result to the central server so as to process the sampling result, wherein the target sampling window is any one of the plurality of sampling windows;
the load server is specifically configured to: determining the window size and the window number of a sampling window according to the sampling proportion, and dividing the current HDFS data block into data of a plurality of sampling windows according to the window size and the window number, wherein the window number is the reciprocal of the sampling proportion, and the window size is the product of the size of the current HDFS data block and the sampling proportion; data within the target sampling window is read.
CN201910625106.3A 2019-07-11 2019-07-11 Mass data sampling method and device Active CN110308998B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910625106.3A CN110308998B (en) 2019-07-11 2019-07-11 Mass data sampling method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910625106.3A CN110308998B (en) 2019-07-11 2019-07-11 Mass data sampling method and device

Publications (2)

Publication Number Publication Date
CN110308998A CN110308998A (en) 2019-10-08
CN110308998B true CN110308998B (en) 2021-09-07

Family

ID=68079861

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910625106.3A Active CN110308998B (en) 2019-07-11 2019-07-11 Mass data sampling method and device

Country Status (1)

Country Link
CN (1) CN110308998B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110826697B (en) * 2019-10-31 2023-06-06 深圳市商汤科技有限公司 Method and device for acquiring sample, electronic equipment and storage medium
CN111694802B (en) * 2020-06-12 2023-04-28 百度在线网络技术(北京)有限公司 Method and device for obtaining duplicate removal information and electronic equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102497209A (en) * 2011-12-06 2012-06-13 南京信息工程大学 Sliding window type data sampling method and device
CN105117171A (en) * 2015-08-28 2015-12-02 南京国电南自美卓控制系统有限公司 Energy SCADA massive data distributed processing system and method thereof
CN107666417A (en) * 2017-10-18 2018-02-06 盛科网络(苏州)有限公司 The method for realizing IPFIX stochastical samplings
CN108681489A (en) * 2018-05-25 2018-10-19 西安交通大学 It is a kind of it is super calculate environment under mass data in real time acquisition and processing method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102497209A (en) * 2011-12-06 2012-06-13 南京信息工程大学 Sliding window type data sampling method and device
CN105117171A (en) * 2015-08-28 2015-12-02 南京国电南自美卓控制系统有限公司 Energy SCADA massive data distributed processing system and method thereof
CN107666417A (en) * 2017-10-18 2018-02-06 盛科网络(苏州)有限公司 The method for realizing IPFIX stochastical samplings
CN108681489A (en) * 2018-05-25 2018-10-19 西安交通大学 It is a kind of it is super calculate environment under mass data in real time acquisition and processing method

Also Published As

Publication number Publication date
CN110308998A (en) 2019-10-08

Similar Documents

Publication Publication Date Title
US9712646B2 (en) Automated client/server operation partitioning
CN110908641B (en) Visualization-based stream computing platform, method, device and storage medium
CN106407207B (en) Real-time newly-added data updating method and device
US20170228422A1 (en) Flexible task scheduler for multiple parallel processing of database data
US10812322B2 (en) Systems and methods for real time streaming
CN110308998B (en) Mass data sampling method and device
CN110688382A (en) Data storage query method and device, computer equipment and storage medium
CN110955732A (en) Method and system for realizing partition load balance in Spark environment
US10248562B2 (en) Cost-based garbage collection scheduling in a distributed storage environment
CN109614270A (en) Data read-write method, device, equipment and storage medium based on Hbase
CN110019341B (en) Data query method and device
CN108153859A (en) A kind of effectiveness order based on Hadoop and Spark determines method parallel
CN109522273B (en) Method and device for realizing data writing
CN113407343A (en) Service processing method, device and equipment based on resource allocation
CN111126619B (en) Machine learning method and device
CN115129460A (en) Method and device for acquiring operator hardware time, computer equipment and storage medium
CN110442616B (en) Page access path analysis method and system for large data volume
CN111159106A (en) Data query method and device
CN116303246A (en) Storage increment statistical method, device, computer equipment and storage medium
CN113516506B (en) Data processing method and device and electronic equipment
CN114020946A (en) Target judgment processing method and system based on multi-graph retrieval data fusion
CN111767287A (en) Data import method, device, equipment and computer storage medium
CN109344091B (en) Buffer array regulation method, device, terminal and readable medium
CN114390107B (en) Request processing method, apparatus, computer device, storage medium, and program product
CN115544096B (en) Data query method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant