CN114357069B - Big data sampling method and system based on distributed storage - Google Patents
Big data sampling method and system based on distributed storage Download PDFInfo
- Publication number
- CN114357069B CN114357069B CN202111588216.0A CN202111588216A CN114357069B CN 114357069 B CN114357069 B CN 114357069B CN 202111588216 A CN202111588216 A CN 202111588216A CN 114357069 B CN114357069 B CN 114357069B
- Authority
- CN
- China
- Prior art keywords
- sampling
- data
- index
- sampling rate
- time period
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000005070 sampling Methods 0.000 title claims abstract description 376
- 238000003860 storage Methods 0.000 title claims abstract description 75
- 238000000034 method Methods 0.000 title claims abstract description 27
- 238000004364 calculation method Methods 0.000 claims abstract description 109
- 230000002776 aggregation Effects 0.000 claims abstract description 24
- 238000004220 aggregation Methods 0.000 claims abstract description 24
- 230000002159 abnormal effect Effects 0.000 claims description 45
- 238000013500 data storage Methods 0.000 claims description 12
- 238000012937 correction Methods 0.000 claims description 2
- 238000012795 verification Methods 0.000 claims description 2
- 238000010586 diagram Methods 0.000 description 4
- 230000009286 beneficial effect Effects 0.000 description 3
- 238000007405 data analysis Methods 0.000 description 3
- 238000013461 design Methods 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 230000006399 behavior Effects 0.000 description 2
- 238000004422 calculation algorithm Methods 0.000 description 2
- 230000003203 everyday effect Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000004931 aggregating effect Effects 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Landscapes
- Information Transfer Between Computers (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The embodiment of the invention provides a big data sampling method and a big data sampling system based on distributed storage, wherein the big data sampling method comprises the following steps: for each index, acquiring index data in a preset time period from a distributed storage module, and randomly sampling the index data in the preset time period according to an initial sampling rate to obtain corresponding index sampling data; correcting the former sampling rate according to a preset mode to obtain the latter sampling rate, and randomly sampling the index data of the preset time period according to the latter sampling rate to obtain corresponding index sampling data; carrying out aggregation calculation on index sampling data corresponding to each sampling rate to obtain a calculation result corresponding to the sampling rate; the final sampling rate of the index is used until the calculation result obtained by carrying out aggregation calculation on index sampling data obtained by randomly sampling according to the corrected sampling rate meets the preset requirement; and sampling index data from the distributed storage module by adopting the final sampling rate of the index. And the final sampling rate is determined based on trial calculation, so that the cost of the server is reduced, and the calculation time is saved.
Description
Technical Field
The invention relates to the field of data analysis, in particular to a big data sampling method and system based on distributed storage.
Background
With the rapid popularity of the internet, a large amount of data is generated every day. For internet enterprises, a large data platform is needed to calculate mass data. This calculation consumes much power and takes a long time.
For example, in a large internet enterprise, hundreds of millions of user behavior logs are used every day, and in order to calculate a user behavior index, hundreds of servers are required to be used, and calculation can be completed after 4-5 hours. This kind of mass data calculation is time consuming and laborious, brings very big cost for the enterprise.
Disclosure of Invention
The embodiment of the invention provides a big data sampling method and a big data sampling system based on distributed storage, which determine the final sampling rate based on trial calculation, and can ensure that the calculation accuracy meets the service requirement, reduce the cost of a server and save the calculation time.
In order to achieve the above object, in one aspect, an embodiment of the present invention provides a big data sampling method based on distributed storage, including:
Storing big data comprising various index data by adopting a distributed storage module, and setting an initial sampling rate of the index when the index data is sampled by the distributed storage module;
For each index, acquiring index data in a preset time period from a distributed storage module, and randomly sampling the index data in the preset time period according to an initial sampling rate to obtain corresponding index sampling data; correcting the former sampling rate according to a preset mode to obtain the latter sampling rate, and randomly sampling the index data of the preset time period according to the latter sampling rate to obtain corresponding index sampling data; carrying out aggregation calculation on index sampling data corresponding to each sampling rate to obtain a calculation result corresponding to the sampling rate; taking the sampling rate corresponding to the calculation result meeting the preset requirement as the final sampling rate of the index until the calculation result obtained by carrying out aggregation calculation on index sampling data obtained by randomly sampling according to the corrected sampling rate meets the preset requirement; wherein the initial sample rate is a first sample rate;
And sampling index data from the distributed storage module by adopting the final sampling rate of the index.
In another aspect, an embodiment of the present invention provides a big data sampling system based on distributed storage, including:
The data storage unit is used for storing big data comprising various index data by adopting a distributed storage module;
The coordination manager is used for setting the initial sampling rate of the index when the self-distributed storage module samples the index data;
The sampling rate calculation unit is used for acquiring index data in a preset time period from the distributed storage module according to each index, and acquiring corresponding index sampling data from the index data in the preset time period by random sampling according to an initial sampling rate set by the coordination manager; correcting the former sampling rate according to a preset mode to obtain the latter sampling rate, and randomly sampling the index data of the preset time period according to the latter sampling rate to obtain corresponding index sampling data; carrying out aggregation calculation on index sampling data corresponding to each sampling rate to obtain a calculation result corresponding to the sampling rate; taking the sampling rate corresponding to the calculation result meeting the preset requirement as the final sampling rate of the index until the calculation result obtained by carrying out aggregation calculation on index sampling data obtained by randomly sampling according to the corrected sampling rate meets the preset requirement; wherein the initial sample rate is a first sample rate;
And the sampling unit is used for sampling index data from the distributed storage module by adopting the final sampling rate of the index.
The technical scheme has the following beneficial effects: the final sampling rate determination scheme based on trial calculation determination can ensure that the calculation accuracy meets the service requirement, reduce the cost of a server and save the calculation time.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a big data sampling method based on distributed storage according to an embodiment of the present invention;
FIG. 2 is a block diagram of a big data sampling system based on distributed storage in accordance with an embodiment of the present invention;
FIG. 3 is a block diagram of a big data system in accordance with an embodiment of the present invention;
FIG. 4 is a distributed data storage structure of an embodiment of the present invention;
fig. 5 is a sample calculation architecture diagram of an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
As shown in fig. 1, the embodiment of the present invention is summarized, and a big data sampling method based on distributed storage is provided, which includes:
S101: storing big data comprising various index data by adopting a distributed storage module;
Big data (big data), the data volume is widely considered to be large in the industry, and exceeds the analysis and calculation capacity of the traditional database, and the multi-machine cluster is generally required to calculate the size. 5V characteristics of big data: volume, velocity, variety, value, veracity (authenticity). The data set with large scale which is greatly beyond the capability range of the traditional database software tool in the aspects of acquisition, storage, management and analysis has four large characteristics of massive data scale, rapid data circulation, various data types and low value density.
S102: setting an initial sampling rate of an index when the self-distributed storage module samples index data;
S103: for each index, acquiring index data in a preset time period from a distributed storage module, and randomly sampling the index data in the preset time period according to an initial sampling rate to obtain corresponding index sampling data; correcting the former sampling rate according to a preset mode to obtain the latter sampling rate, and randomly sampling the index data of the preset time period according to the latter sampling rate to obtain corresponding index sampling data; carrying out aggregation calculation on index sampling data corresponding to each sampling rate to obtain a calculation result corresponding to the sampling rate; taking the sampling rate corresponding to the calculation result meeting the preset requirement as the final sampling rate of the index until the calculation result obtained by carrying out aggregation calculation on index sampling data obtained by randomly sampling according to the corrected sampling rate meets the preset requirement; wherein the initial sample rate is a first sample rate;
S104: and sampling index data from the distributed storage module by adopting the final sampling rate of the index.
Preferably, step 101 comprises:
and sequentially writing big data into different distributed storage modules in the packaged hdfs system by adopting the packaged hdfs system.
Preferably, in step 103, said correcting the previous sampling rate in a preset manner to obtain the next sampling rate includes: and correcting the previous sampling rate in a mode of halving iteration of the previous sampling rate to obtain the next sampling rate.
Preferably, in step 103, for each index, aggregate calculation is performed on index sampling data corresponding to each sampling rate to obtain a calculation result corresponding to the sampling rate; until the calculation result obtained by carrying out aggregation calculation on index sampling data obtained by random sampling according to the corrected sampling rate meets the preset requirement, taking the sampling rate corresponding to the calculation result meeting the preset requirement as the final sampling rate of the index, and comprising the following steps:
S1031: calculating the duty ratio of the abnormal index data in the index sampling data in the index data of the preset time period according to the index sampling data of each sampling rate, and calculating the duty ratio error of the abnormal data corresponding to the sampling rate; the abnormal data duty ratio error refers to a difference between a duty ratio of abnormal index data in the index data of the preset time period during sampling and a duty ratio of abnormal index data in the non-sampling time period, and the duty ratio of the abnormal index data in the non-sampling time period refers to a duty ratio of all the abnormal index data in the index data of the preset time period during the preset time period;
s1032: and when the abnormal data duty ratio error corresponding to the previous sampling rate is smaller than a preset error threshold value and the abnormal data duty ratio error corresponding to the latter sampling rate is larger than the preset error threshold value, taking the previous sampling rate as the final sampling rate of the index.
Preferably, the method further comprises:
s105: pushing index data in a preset time period acquired from a distributed storage module and index sampling data obtained by randomly sampling the index data in the preset time period according to a specified time interval aiming at all indexes;
In step 103, the aggregating calculation is performed on the index sampling data corresponding to each sampling rate to obtain a calculation result corresponding to the sampling rate, which specifically includes:
And after receiving index data in a preset time period acquired from the distributed storage module and index sampling data obtained by random sampling in the index data in the preset time period, carrying out aggregate calculation on the index sampling data corresponding to the sampling rate to obtain a calculation result corresponding to the sampling rate.
As shown in fig. 2, in connection with an embodiment of the present invention, there is provided a big data sampling system based on distributed storage, including:
A data storage unit 21 for storing big data including various index data by using a distributed storage module;
A coordination manager 22 for setting an initial sampling rate of the index when the index data is sampled from the distributed storage module;
The sampling rate calculating unit 23 is configured to obtain, for each index, index data in a preset time period from the distributed storage module, and randomly sample the index data in the preset time period according to an initial sampling rate set by the coordination manager to obtain corresponding index sampling data; correcting the former sampling rate according to a preset mode to obtain the latter sampling rate, and randomly sampling the index data of the preset time period according to the latter sampling rate to obtain corresponding index sampling data; carrying out aggregation calculation on index sampling data corresponding to each sampling rate to obtain a calculation result corresponding to the sampling rate; taking the sampling rate corresponding to the calculation result meeting the preset requirement as the final sampling rate of the index until the calculation result obtained by carrying out aggregation calculation on index sampling data obtained by randomly sampling according to the corrected sampling rate meets the preset requirement; wherein the initial sample rate is a first sample rate;
And a sampling unit 24 for sampling the index data from the distributed storage module using the final sampling rate of the index.
Preferably, the data storage unit 21 includes:
And the packaged hdfs system is used for sequentially writing big data into different distributed storage modules in the packaged hdfs system.
Preferably, the sampling rate calculation unit 23 includes:
The sampling rate correction subunit 231 is configured to correct the previous sampling rate in a preset manner to obtain the next sampling rate, where correcting the previous sampling rate in the preset manner to obtain the next sampling rate includes: and correcting the previous sampling rate in a mode of halving iteration of the previous sampling rate to obtain the next sampling rate.
Preferably, the sampling rate calculation unit 23 includes:
A sampling rate verification subunit 232, configured to calculate, for each index sampling data of a sampling rate, a duty ratio of abnormal index data in the index sampling data in index data of the preset time period, and a duty ratio error of abnormal data corresponding to the sampling rate; the abnormal data duty ratio error refers to a difference between a duty ratio of abnormal index data in the index data of the preset time period during sampling and a duty ratio of abnormal index data in the non-sampling time period, and the duty ratio of the abnormal index data in the non-sampling time period refers to a duty ratio of all the abnormal index data in the index data of the preset time period during the preset time period;
The sampling rate determining subunit 233 is configured to take the previous sampling rate as the final sampling rate of the index when the abnormal data duty ratio error corresponding to the previous sampling rate is smaller than the preset error threshold and the abnormal data duty ratio error corresponding to the subsequent sampling rate is larger than the preset error threshold.
Preferably, the data pushing unit 25 is further included, and the sampling rate calculating unit 23 includes an aggregation calculating subunit 233, where:
The data pushing unit 25 is configured to push, for all the indexes, the index data in a preset time period acquired from the distributed storage module and the index sampling data obtained by randomly sampling the index data in the preset time period at specified time intervals;
The aggregation calculation subunit 234 is configured to aggregate the index sampling data corresponding to the sampling rate to obtain a calculation result corresponding to the sampling rate after receiving the index data in the preset time period acquired from the distributed storage module pushed by the data pushing unit and the index sampling data randomly sampled from the index data in the preset time period.
The beneficial effects obtained by the invention are as follows:
The method can quickly calculate mass data, so that the cost of the server is greatly saved, and the calculation time is saved.
The sampling rate is determined in a trial calculation mode, so that the method can be flexibly adapted to the calculation precision of various indexes.
The foregoing technical solutions of the embodiments of the present invention will be described in detail with reference to specific application examples, and reference may be made to the foregoing related description for details of the implementation process that are not described.
Abbreviations and key terms involved in the present invention are defined as follows:
sampling in real time: the rapid big data computing system initiated by the patent of the invention is much faster than the traditional big data computing system.
Distributed type: the task is distributed to many servers relatively centrally.
And (3) data sampling: and according to a sampling algorithm, acquiring a part of data from the mass data set to perform data analysis.
Distributed data sampling: the sampling operation is distributed, and the operation speed is faster.
The invention relates to a rapid big data computing system based on distributed real-time sampling, belonging to the technical field of big data and the technical field of data analysis; the rapid big data computing system based on distributed real-time sampling is realized, the data can be distributed and sampled while being computed, the operation needs to be performed on a server and time can be greatly saved, namely, a distributed real-time sampling mechanism is adopted, and rapid big data computation can be realized; the final sampling rate is determined based on trial calculation, so that the calculation accuracy can be guaranteed to meet the service requirement, the cost of a server is reduced, and the calculation time is saved.
For example, the same billions of logs can be calculated in tens of minutes by only needing a few servers. The cost is saved, and the operation time is saved. Can bring good benefits to enterprises. Is a big data computing system with extremely strong practicability.
The architecture diagram of the big data system in the technical scheme of the invention is shown in fig. 3, and the distributed real-time sampling big data computing system mainly comprises a data storage module (a data storage unit), a sampling computing module (a sampling rate computing unit) and a coordination manager. Wherein:
the data storage module, as shown in fig. 4, is responsible for storing the mass data ready for calculation. And storing big data comprising various index data by adopting a distributed storage module, and sequentially writing the big data into different distributed storage modules in the packaged hdfs system by adopting the packaged hdfs system. Specifically:
Because of the large amount of data, a single unit cannot accommodate. The data storage module adopts a distributed file storage system. Is achieved by encapsulating the hdfs system. To facilitate quick access to data, files are required to be written sequentially, one file per 1G size. The files written in sequence can be read quickly. One file per 1G size, the number of files can be controlled not to be excessive (1G size files are not too much).
And a sampling calculation module: for each index, acquiring index data in a preset time period from a distributed storage module, and randomly sampling the index data in the preset time period according to an initial sampling rate to obtain corresponding index sampling data; correcting the former sampling rate according to a preset mode to obtain the latter sampling rate, and randomly sampling the index data of the preset time period according to the latter sampling rate to obtain corresponding index sampling data; carrying out aggregation calculation on index sampling data corresponding to each sampling rate to obtain a calculation result corresponding to the sampling rate; taking the sampling rate corresponding to the calculation result meeting the preset requirement as the final sampling rate of the index until the calculation result obtained by carrying out aggregation calculation on index sampling data obtained by randomly sampling according to the corrected sampling rate meets the preset requirement; wherein the initial sample rate is a first sample rate.
Specifically: an individual sample calculation module architecture is shown in fig. 5, and obtains data from the data storage module, and communicates with the coordination manager at the same time, and determines how much data to be currently processed is to be reserved according to the sampling rate given by the coordination manager, and performs corresponding aggregate calculation at the same time. Different data indexes and sampling rates have different influences on the calculation accuracy. And therefore cannot employ a uniform sampling rate. Correcting the previous sampling rate according to a preset mode to obtain the next sampling rate comprises the following steps: the previous sampling rate is modified in a half-iteration manner to the previous sampling rate to obtain the next sampling rate (the sampling rate is determined by a half-iteration trial). Calculating the duty ratio of the abnormal index data in the index sampling data in the index data of the preset time period according to the index sampling data of each sampling rate, and calculating the duty ratio error of the abnormal data corresponding to the sampling rate; the abnormal data duty ratio error refers to a difference between a duty ratio of abnormal index data in the preset time period during sampling and a duty ratio of abnormal index data in the non-sampling time period, and the duty ratio of the abnormal index data in the non-sampling time period refers to a duty ratio of all the abnormal index data in the preset time period. And when the abnormal data duty ratio error corresponding to the previous sampling rate is smaller than a preset error threshold value and the abnormal data duty ratio error corresponding to the latter sampling rate is larger than the preset error threshold value, taking the previous sampling rate as the final sampling rate of the index. For example, the initial sampling rate is 50%, and the coordination manager will take a small batch of data to try out 50% and 25% after half. If the calculation error between the two is acceptable, 25% is subjected to halving and rounding again (decimal rounding is omitted), the sampling rate is 12% for trial calculation, and the 50% sampled small batch data for trial calculation is compared. Until the sampling rate is greater than the acceptable error stop set in advance. The sampling rate at which the acceptable error is greatest will be the acceptable sampling rate.
The error is generally set in advance, and each index is different. It is necessary to see how well the service is acceptable. Such as the second rate of video play, which is a scale indicator, the acceptable error of the service is 0.03%. The sampling rate of this index can be made 6% (every 100 parts, 6 parts are taken.)
The sampling rate setting principle is as follows:
(1) The initial sampling rate was 50%.
(2) And when the sampling rate is calculated, the half-iteration is continuously carried out. Dividing the sampling rate by two rounds (leaving the fractional part round out).
(3) And iteratively calculating the corresponding sampling rate according to the error rate preset by the service, and storing the sampling rate in a coordination manager.
The sampling rate is determined in a trial calculation mode, so that the method can be flexibly adapted to the calculation precision of various indexes.
The sampling rate is stored in the coordination manager, so that each sampling calculation module can be ensured to be coordinated and consistent.
For example, the sampling rate is 3% sampling, and each sampling calculation module randomly reserves 3 parts of data every 100 parts of data when sequentially reading the data files to calculate.
The data which are not hit and sampled are directly discarded, and calculation is not performed, so that the calculation resources are greatly saved.
The above example is a sampling rate of 3%, and in practical use, there are different sampling rates based on different service acceptance accuracy rates.
If the accuracy of the business requirements is too high, the system is not suitable for operation. In practice, normal traffic demands can be met with a sampling rate of 12% or even higher.
And pushing index data in a preset time period acquired from the distributed storage module and index sampling data obtained by randomly sampling the index data in the preset time period according to the specified time interval aiming at all indexes. Specifically: while sampling, a batch of data is sent to the computation module for computation (all of the computation module computes sampled data) at specified times (e.g., every 10 seconds, which may be configurable by the orchestration manager, set by the business's needs, default of 10 seconds). The computing module encapsulates a distributed computing engine, here implemented using SPARK STREAMING.
Coordination manager: is the total control core of the whole big data system. And the coordination of each sampling calculation module is responsible for acquiring data from the data storage module, sampling according to a set sampling rate and executing corresponding big data operation.
The beneficial effects obtained by the invention are as follows:
The method can quickly calculate mass data, so that the cost of the server is greatly saved, and the calculation time is saved.
The sampling rate is determined in a trial calculation mode, so that the method can be flexibly adapted to the calculation precision of various indexes.
The accuracy of the calculation result is maintained by the following means:
(1) The calculation index is preferably a proportion index, the proportion numerator and the proportion denominator are the same log, and the same sampling rate is used.
(2) The sampling needs to be performed uniformly and as small a time interval as possible is guaranteed to have data. Such as no data present for a certain second or several seconds after sampling.
(3) The determination of the sampling rate is attempted by a sampling coordination manager.
The sampling rate setting principle is as follows:
(a) The initial sampling rate was 50%.
(B) And when the sampling rate is calculated, the half-iteration is continuously carried out. Dividing the sampling rate by two rounds (leaving the fractional part round out).
(C) And iteratively calculating the corresponding sampling rate according to the error rate preset by the service, and storing the sampling rate in a coordination manager.
It should be understood that the specific order or hierarchy of steps in the processes disclosed are examples of exemplary approaches. Based on design preferences, it is understood that the specific order or hierarchy of steps in the processes may be rearranged without departing from the scope of the present disclosure. The accompanying method claims present elements of the various steps in a sample order, and are not meant to be limited to the specific order or hierarchy presented.
In the foregoing detailed description, various features are grouped together in a single embodiment for the purpose of streamlining the disclosure. This method of disclosure is not to be interpreted as reflecting an intention that the claimed embodiments of the subject matter require more features than are expressly recited in each claim. Rather, as the following claims reflect, invention lies in less than all features of a single disclosed embodiment. Thus the following claims are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate preferred embodiment of this invention.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. As will be apparent to those skilled in the art; various modifications to these embodiments will be readily apparent, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
The foregoing description includes examples of one or more embodiments. It is, of course, not possible to describe every conceivable combination of components or methodologies for purposes of describing the aforementioned embodiments, but one of ordinary skill in the art may recognize that many further combinations and permutations of various embodiments are possible. Accordingly, the embodiments described herein are intended to embrace all such alterations, modifications and variations that fall within the scope of the appended claims. Furthermore, as used in the specification or claims, the term "comprising" is intended to be inclusive in a manner similar to the term "comprising," as interpreted when employed as a transitional word in a claim. Furthermore, any use of the term "or" in the specification of the claims is intended to mean "non-exclusive or".
Those of skill in the art will further appreciate that the various illustrative logical blocks (illustrative logical block), units, and steps described in connection with the embodiments of the invention may be implemented by electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software (interchangeability), various illustrative components described above (illustrative components), elements, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design requirements of the overall system. Those skilled in the art may implement the described functionality in varying ways for each particular application, but such implementation is not to be understood as beyond the scope of the embodiments of the present invention.
The various illustrative logical blocks or units described in the embodiments of the invention may be implemented or performed with a general purpose processor, a digital signal processor, an Application Specific Integrated Circuit (ASIC), a field programmable gate array or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described. A general purpose processor may be a microprocessor, but in the alternative, the general purpose processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a digital signal processor and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a digital signal processor core, or any other similar configuration.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may be stored in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. In an example, a storage medium may be coupled to the processor such that the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC, which may reside in a user terminal. In the alternative, the processor and the storage medium may reside as distinct components in a user terminal.
In one or more exemplary designs, the above-described functions of embodiments of the present invention may be implemented in hardware, software, firmware, or any combination of the three. If implemented in software, the functions may be stored on a computer-readable medium or transmitted as one or more instructions or code on the computer-readable medium. Computer readable media includes both computer storage media and communication media that facilitate transfer of computer programs from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer. For example, such computer-readable media may include, but is not limited to, RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that may be used to carry or store program code in the form of instructions or data structures and other data structures that may be read by a general or special purpose computer, or a general or special purpose processor. Further, any connection is properly termed a computer-readable medium, e.g., if the software is transmitted from a website, server, or other remote source via a coaxial cable, fiber optic cable, twisted pair, digital Subscriber Line (DSL), or wireless such as infrared, radio, and microwave, and is also included in the definition of computer-readable medium. The disks (disks) and disks (disks) include compact disks, laser disks, optical disks, DVDs, floppy disks, and blu-ray discs where disks usually reproduce data magnetically, while disks usually reproduce data optically with lasers. Combinations of the above may also be included within the computer-readable media.
The foregoing description of the embodiments has been provided for the purpose of illustrating the general principles of the invention, and is not meant to limit the scope of the invention, but to limit the invention to the particular embodiments, and any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the invention are intended to be included within the scope of the invention.
Claims (8)
1. A big data sampling method based on distributed storage, comprising:
Storing big data comprising various index data by adopting a distributed storage module, and setting an initial sampling rate of the index when the index data is sampled by the distributed storage module;
For each index, acquiring index data in a preset time period from a distributed storage module, and randomly sampling the index data in the preset time period according to an initial sampling rate to obtain corresponding index sampling data; correcting the former sampling rate according to a preset mode to obtain the latter sampling rate, and randomly sampling the index data of the preset time period according to the latter sampling rate to obtain corresponding index sampling data; carrying out aggregation calculation on index sampling data corresponding to each sampling rate to obtain a calculation result corresponding to the sampling rate; taking the sampling rate corresponding to the calculation result meeting the preset requirement as the final sampling rate of the index until the calculation result obtained by carrying out aggregation calculation on index sampling data obtained by randomly sampling according to the corrected sampling rate meets the preset requirement; wherein the initial sample rate is a first sample rate;
Sampling index data by adopting a final sampling rate self-distributed storage module of the index;
The index sampling data corresponding to each sampling rate are subjected to aggregation calculation aiming at each index to obtain a calculation result corresponding to the sampling rate; until the calculation result obtained by carrying out aggregation calculation on index sampling data obtained by random sampling according to the corrected sampling rate meets the preset requirement, taking the sampling rate corresponding to the calculation result meeting the preset requirement as the final sampling rate of the index, and comprising the following steps:
Calculating the duty ratio of the abnormal index data in the index sampling data in the index data of the preset time period according to the index sampling data of each sampling rate, and calculating the duty ratio error of the abnormal data corresponding to the sampling rate; the abnormal data duty ratio error refers to a difference between a duty ratio of abnormal index data in the index data of the preset time period during sampling and a duty ratio of abnormal index data in the non-sampling time period, and the duty ratio of the abnormal index data in the non-sampling time period refers to a duty ratio of all the abnormal index data in the index data of the preset time period during the preset time period;
And when the abnormal data duty ratio error corresponding to the previous sampling rate is smaller than a preset error threshold value and the abnormal data duty ratio error corresponding to the latter sampling rate is larger than the preset error threshold value, taking the previous sampling rate as the final sampling rate of the index.
2. The big data sampling method based on distributed storage according to claim 1, wherein the big data including various index number data is stored by using a distributed storage module, comprising:
and sequentially writing big data into different distributed storage modules in the packaged hdfs system by adopting the packaged hdfs system.
3. The distributed storage-based big data sampling method according to claim 1, wherein correcting the previous sampling rate in a preset manner to obtain the next sampling rate comprises: and correcting the previous sampling rate in a mode of halving iteration of the previous sampling rate to obtain the next sampling rate.
4. The distributed storage-based big data sampling method according to claim 1, further comprising:
Pushing index data in a preset time period acquired from a distributed storage module and index sampling data obtained by randomly sampling the index data in the preset time period according to a specified time interval aiming at all indexes;
the aggregation calculation is performed on the index sampling data corresponding to each sampling rate to obtain a calculation result corresponding to the sampling rate, and the method specifically comprises the following steps:
And after receiving index data in a preset time period acquired from the distributed storage module and index sampling data obtained by random sampling in the index data in the preset time period, carrying out aggregate calculation on the index sampling data corresponding to the sampling rate to obtain a calculation result corresponding to the sampling rate.
5. A big data sampling system based on distributed storage, comprising:
The data storage unit is used for storing big data comprising various index data by adopting a distributed storage module;
The coordination manager is used for setting the initial sampling rate of the index when the self-distributed storage module samples the index data;
The sampling rate calculation unit is used for acquiring index data in a preset time period from the distributed storage module according to each index, and acquiring corresponding index sampling data from the index data in the preset time period by random sampling according to an initial sampling rate set by the coordination manager; correcting the former sampling rate according to a preset mode to obtain the latter sampling rate, and randomly sampling the index data of the preset time period according to the latter sampling rate to obtain corresponding index sampling data; carrying out aggregation calculation on index sampling data corresponding to each sampling rate to obtain a calculation result corresponding to the sampling rate; taking the sampling rate corresponding to the calculation result meeting the preset requirement as the final sampling rate of the index until the calculation result obtained by carrying out aggregation calculation on index sampling data obtained by randomly sampling according to the corrected sampling rate meets the preset requirement; wherein the initial sample rate is a first sample rate;
The sampling unit is used for sampling index data from the distributed storage module by adopting the final sampling rate of the index;
the sampling rate calculation unit includes:
The sampling rate verification subunit is used for calculating the duty ratio of the abnormal index data in the index sampling data in the index data in the preset time period according to the index sampling data of each sampling rate, and calculating the duty ratio error of the abnormal data corresponding to the sampling rate; the abnormal data duty ratio error refers to a difference between a duty ratio of abnormal index data in the index data of the preset time period during sampling and a duty ratio of abnormal index data in the non-sampling time period, and the duty ratio of the abnormal index data in the non-sampling time period refers to a duty ratio of all the abnormal index data in the index data of the preset time period during the preset time period;
the sampling rate determining subunit is configured to take the previous sampling rate as the final sampling rate of the index when the abnormal data duty ratio error corresponding to the previous sampling rate is smaller than the preset error threshold and the abnormal data duty ratio error corresponding to the subsequent sampling rate is larger than the preset error threshold.
6. The distributed storage based big data sampling system of claim 5, wherein the data storage unit comprises:
And the packaged hdfs system is used for sequentially writing big data into different distributed storage modules in the packaged hdfs system.
7. The distributed storage based big data sampling system of claim 5, wherein the sample rate calculation unit comprises:
the sampling rate correction subunit is configured to correct the previous sampling rate according to a preset manner to obtain a subsequent sampling rate, where correcting the previous sampling rate according to the preset manner to obtain the subsequent sampling rate includes: and correcting the previous sampling rate in a mode of halving iteration of the previous sampling rate to obtain the next sampling rate.
8. The distributed storage based big data sampling system of claim 5, further comprising a data pushing unit, the sample rate calculation unit comprising an aggregate calculation subunit, wherein:
The data pushing unit is used for pushing the index data in a preset time period acquired from the distributed storage module and index sampling data obtained by randomly sampling the index data in the preset time period according to a specified time interval for all indexes;
The aggregation calculation subunit is configured to, after receiving the index data in the preset time period acquired from the distributed storage module pushed by the data pushing unit and the index sampling data randomly sampled from the index data in the preset time period, perform aggregation calculation on the index sampling data corresponding to the sampling rate to obtain a calculation result corresponding to the sampling rate.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111588216.0A CN114357069B (en) | 2021-12-23 | 2021-12-23 | Big data sampling method and system based on distributed storage |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111588216.0A CN114357069B (en) | 2021-12-23 | 2021-12-23 | Big data sampling method and system based on distributed storage |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114357069A CN114357069A (en) | 2022-04-15 |
CN114357069B true CN114357069B (en) | 2024-05-28 |
Family
ID=81102301
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111588216.0A Active CN114357069B (en) | 2021-12-23 | 2021-12-23 | Big data sampling method and system based on distributed storage |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114357069B (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107423433A (en) * | 2017-08-03 | 2017-12-01 | 聚好看科技股份有限公司 | A kind of data sampling rate control method and device |
WO2018027466A1 (en) * | 2016-08-08 | 2018-02-15 | 马岩 | Method and system for storing big data in distributed system |
CN113807396A (en) * | 2021-08-12 | 2021-12-17 | 华南理工大学 | Method, system, device and medium for detecting abnormality of high-dimensional data of Internet of things |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8578041B2 (en) * | 2005-06-03 | 2013-11-05 | Adobe Systems Incorporated | Variable sampling rates for website visitation analysis |
CN107133190A (en) * | 2016-02-29 | 2017-09-05 | 阿里巴巴集团控股有限公司 | The training method and training system of a kind of machine learning system |
-
2021
- 2021-12-23 CN CN202111588216.0A patent/CN114357069B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2018027466A1 (en) * | 2016-08-08 | 2018-02-15 | 马岩 | Method and system for storing big data in distributed system |
CN107423433A (en) * | 2017-08-03 | 2017-12-01 | 聚好看科技股份有限公司 | A kind of data sampling rate control method and device |
CN113807396A (en) * | 2021-08-12 | 2021-12-17 | 华南理工大学 | Method, system, device and medium for detecting abnormality of high-dimensional data of Internet of things |
Also Published As
Publication number | Publication date |
---|---|
CN114357069A (en) | 2022-04-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106407190B (en) | Event record query method and device | |
US11379687B2 (en) | Method for extracting feature string, device, network apparatus, and storage medium | |
US20150324135A1 (en) | Automatic storage system configuration based on workload monitoring | |
CN110471821B (en) | Abnormality change detection method, server, and computer-readable storage medium | |
CN110750529B (en) | Data processing method, device, equipment and storage medium | |
CN104811344A (en) | Network dynamic service monitoring method and apparatus | |
CN107070940B (en) | Method and device for judging malicious login IP address from streaming login log | |
CN106202280B (en) | Information processing method and server | |
CN109388550B (en) | Cache hit rate determination method, device, equipment and readable storage medium | |
CN107729375B (en) | Log data sorting method and device | |
CN109801693B (en) | Medical records grouping method and device, terminal and computer readable storage medium | |
CN111258593A (en) | Application program prediction model establishing method and device, storage medium and terminal | |
CN116501715B (en) | Real-time association updating method and device for multi-table full data | |
CN108228679B (en) | Time series data metering method and time series data metering device | |
CN114357069B (en) | Big data sampling method and system based on distributed storage | |
CN111913913B (en) | Access request processing method and device | |
CN112861128B (en) | Method and system for identifying machine account numbers in batches | |
CN106294457B (en) | Network information pushing method and device | |
CN116070958A (en) | Attribution analysis method, attribution analysis device, electronic equipment and storage medium | |
CN112149036A (en) | Method and system for identifying batch abnormal interaction behaviors | |
CN108984101B (en) | Method and device for determining relationship between events in distributed storage system | |
CN111158994A (en) | Pressure testing performance testing method and device | |
CN116776310B (en) | Automatic user account identification method and device, computer equipment and storage medium | |
CN117874069B (en) | Real-time big data rapid query analysis method and device | |
CN115328923B (en) | Storage structure, query method, storage medium and system of time sequence physiological data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |