CN104935348B - A kind of controllable summary data compression method of estimation error - Google Patents

A kind of controllable summary data compression method of estimation error Download PDF

Info

Publication number
CN104935348B
CN104935348B CN201510254377.4A CN201510254377A CN104935348B CN 104935348 B CN104935348 B CN 104935348B CN 201510254377 A CN201510254377 A CN 201510254377A CN 104935348 B CN104935348 B CN 104935348B
Authority
CN
China
Prior art keywords
time
data
sample
summary data
tracker
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510254377.4A
Other languages
Chinese (zh)
Other versions
CN104935348A (en
Inventor
吴广君
云晓春
王树鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Information Engineering of CAS
Original Assignee
Institute of Information Engineering of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Information Engineering of CAS filed Critical Institute of Information Engineering of CAS
Priority to CN201510254377.4A priority Critical patent/CN104935348B/en
Publication of CN104935348A publication Critical patent/CN104935348A/en
Application granted granted Critical
Publication of CN104935348B publication Critical patent/CN104935348B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of controllable summary data compression method of estimation error.This method is:1) a time tracker is established to the summary data of each object;For summary data to be written, the time-tracking device according to corresponding to navigating to object, then time-tracking device summary data is sampled and preserved in sample set corresponding to the time-tracking device;2) sample in the sample set of each time-tracking device is divided into multiple time phases and step-up error parameter;Then time-tracking device error parameter corresponding to samples to sample;3) sample set after processing is merged into a sample set H, the sample data in set H is then divided into multiple time phases and is written to according to corresponding error parameter sampling in the sample set of a new time-tracking device.Summary data not only linear lifting memory space, and remain able to support the approximate calculation that error limits after present invention compression.

Description

A kind of controllable summary data compression method of estimation error
Technical field
The invention belongs to areas of information technology, are constantly expanded for the approximate query system overview data under big data environment Application background, propose a kind of controllable summary data compression method of error.
Background technology
Big data analyzing and processing technology is had been widely used in every profession and trade at present, by analyzing the mass data in industry Resource, timely reliable solution is provided for upper-layer service.Approximate calculation is a kind of important in big data analysis process system Technological means, because approximate calculation needs the summary data than initial data much less, there is provided high-precision approximate calculation knot Fruit, it has been widely adopted in the types of applications of tolerance certain error.Such as united for the microblog data of large-scale microblogging website In the system such as meter, the clickstream data statistics of shopping website, transaction log Data stream statistics, approximate calculation has been not only able to effect To the data scale of magnanimity, while also provide high real-time decision support for upper-layer service.Approximate calculation is further at present Apply in the real-time affection computation (bibliography of network:H.Wang,D.Can,A.Kazemzadeh,F.Bar,and S.Narayanan,“A system for real-time twitter sentiment analysis of 2012u.s.presidential election cycle,”in Proceedings of the ACL2012System Demonstrations,ser.ACL’12.Stroudsburg,PA,USA:Association for Computational Linguistics, 2012, pp.115-120), economic data index prediction (bibliography:T.Preis,H.S.Moat,and E.H.Stanley,“Quantifying trading behavior in financial markets using Google Trends, " Sci.Rep., vol.3, p.1684,2013) and the field such as system for real-time intrusion detection in (reference:X.Yun, Y.Wang,Y.Zhang,and Y.Zhou,“A semantics-aware approach to the automated network protocol identification,”Networking,IEEE/ACM Transactions on,vol.PP, no.99,pp.1–1,2015)。
But the approximate query system under big data environment, face the problem of summary data constantly expands.I.e. with big number According to the drastically expansion of scale, the summary data scale that approximate query is relied on also constantly is increasing.Now face estimation precision With the contradictory problems between summary data amount.The approximate estimation precision provided is higher, and the summary data amount of required storage is just It is bigger.The big data approximate calculation technology being recently proposed, such as:Approximate top-k calculates (reference:J.Jestes, J.M.Phillips,F.Li,and M.Tang,“Ranking large temporal data,”Proc.VLDB Endow., Vol.5, no.11, pp.1412-1423, Jul.2012), approximate range-sum calculate (refer to X.Yun, G.Wu, G.Zhang, K.Li,and S.Wang,“Fastraq:A fast approach to range-aggregate queries in big data environments,”Cloud Computing,IEEE Transactions on,vol.PP,no.99,pp.1–1, 2014), the ordered set method of sampling (reference:E.Cohen,G.Cormode,and N.Duffield,“Structure-aware sampling:Flexible and accurate summarization,”Proceedings of the VLDB Endowment, vol.4, no.11,2011), and sliding window technique (reference:M.Datar,A.Gionis,P.Indyk, and R.Motwani,“Maintaining stream statistics over sliding windows:(extended abstract),”in Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms, ser.SODA ' 02,2002, pp.635-644) etc., all do not account for summary data capacity and ask Topic.When the old and new's data use unified error parameter, if it is desired to obtain high-precision estimation result, then need to set relatively low Error parameter, now just need to safeguard more massive summary data.And for summary data long-term and being not frequently used, Preserve the obvious wasting space of fairly large summary data.Other solution methods also include using high-speed Medium, such as using SSD solid state hard discs, summary data is stored, on the basis of memory size is expanded, improves the access efficiency of summary data.But this Not only cost is higher for one resolving ideas, and still without solve big data environment under, the estimation precision of different summary datas and Contradictory problems between summary data amount.
The content of the invention
For technical problem present in prior art, it is an object of the invention to provide a kind of controllable general of estimation error Want data compression method.The characteristics of present invention is based on big data freshness sensitivity, propose the summary data compression side that error limits Method.The freshness sensitivity characteristic of big data can be described as follows:Any one object object in big data, sometime Point reaches at a high speed, then starts to propagate in related subject, after a couple of days or several weeks, gradually decays, finally withers away.This Invention combines the These characteristics of big data, to long-term existence, and the object being not frequently used, in two dimensions of key, time It is compressed, the not only linear lifting memory space, and remain able to support error to limit near of the summary data after compression Like calculating.
Big data typically has the dimension of two attributes of key and time, and the present invention is corresponding to propose the sum based on key dimensions Two kinds of summary data compression methods based on Time dimensions.From summary data be of little use and long-standing in compression process It is compressed, compresses the calculating logic that later summary data keeps former summary data, and lifting storage that can be linear is empty Between.Further, the compression method and process of summary data is discussed in detail specifically based on determining ripple Sampling techniques in the present invention.
It is of the invention to be for data item form<key,time,value>The summary data structure of data source is compressed, and is closed Key point overview is got up as follows:
1) compression method of the summary data in key dimensions is proposed.Summary data compression process in key dimensions is pin The process of the summary data of a single structure is compressed into summary data corresponding to multiple different key in a set.Pressure The later summary data that contracts can be directed to any key in set and provide estimation error average in set, utilize this method Realize the summary data compression process towards set;
2) the summary data compression method in time dimensions is proposed.Summary data compression process in time dimensions is root According to the time-sensitive feature of big data, new data sets relatively low relative error, and old data set larger relative error so that New data possesses higher computational accuracy, and long-standing summary data estimation precision is relatively low, and basis is matched somebody with somebody between the old and new's data The error parameter put realizes automatic conversion;
3) the concrete application example of above-mentioned summary data compression is provided based on determining ripple Sampling techniques, describes summary number According to structure and the specific compression process of summary data and the maintaining method of summary data in two dimensions of key and time, Effectively solve the contradictory problems between summary data estimation precision and the data volume under true streaming big data environment;
The technical scheme is that:
A kind of controllable summary data compression method of estimation error, its step are:
1) a time tracker tracker is established to each object object summary data;For summary to be written Data, the object object according to corresponding to summary data navigate to corresponding time-tracking device tracker, then time-tracking device Tracker is sampled to corresponding summary data according to the method for sampling that error limits and preserved to the time-tracking device In sample set corresponding to tracker;
2) by each time-tracking device trackeriSample set i in sample when being divided into multiple on time dimension Between the stage and the error parameter of each time phase is set;Then time-tracking device trackeriAccording to i-th of time phase pair The error parameter ξ answerediThe sample of the i-th -1 time phase is sampled;
3) sample set after step 2) processing is merged into a sample set H, then by the sample in sample set H Notebook data is written to a time-tracking device tracker sequentially in timenewSample set in;, should wherein in ablation process Time-tracking device trackernewMultiple time phases are divided on time dimension to the sample in sample set H and set The error parameter of each time phase;Then sample of the error parameter to corresponding time phase according to corresponding to each time phase Sampled.
Further, the method to set up of the error parameter of each time phase is:If i-th of time-tracking device trackeri Sample set in j-th of time phase phasei,jError be ξi,j, then ξi,j=rh*ξ;Wherein, compression parameters r, and r> 1, ξ is the error parameter of first time phase, and h=(Tsmax-StartTs)/TL, TL is time phase phasei,jTime Siding-to-siding block length, Tsmax are time-tracking device trackeriMaximum time stamp, StartTs is time phase phasei,jStarting Time.
Further, the error parameter of each time phase is identical.
Further, the time-tracking device trackeriSample set in sample data structure be:<N,value, Ts>, wherein N is the value of current all write-in data polymerizing value, i.e.,valueiFor i-th of summary number According to value values, value be summary data in be used for count numerical value, Ts is the time caused by flag data.
Further, the number of samples of the maintenance in each layering isFor position more than m ' positions The sample put directly discards.
Further, the time-tracking device trackeriSampled using the determination ripple method of sampling according to N values, and will The sample data of acquisition is placed on the time-tracking device tracker with the incremental order layering of timestampiSample set in.
Further, the method for sampling is:Determine the ripple method of sampling, the random wave method of sampling, stochastical sampling method or The index histogram method of sampling.
Compared with prior art, beneficial effects of the present invention are as follows:
1) the summary data compression processing method that error limits is proposed according to the characteristics of big data freshness sensitivity.Big data Under environment, in order to maintain high-precision approximate calculation, required storage and the summary data scale managed are often very huge. The present invention according to big data using having the characteristics of freshness sensitivity, from long-term existence and the summary data that is not frequently used, The fixed compression processing of the limits of error is carried out in two dimensions of key and time.With traditional approximate calculation system and summary data Maintaining method is compared, and carries out the processing of differentiation to sample data while estimation precision demand has been taken into account, and is not only effectively propped up The demand of high-precision approximate calculation has been held, while has significantly reduced the expense of summary data storage and maintenance.
By taking specific FS-Sketch as an example, specific summary data compression and maintaining method are given.Summary after compression Data still possess the application characteristic of original summary data, while the error parameter of approximate calculation result is controllable.With it is existing Method is compared, and can more be adapted to the types of applications in streaming big data with freshness sensitive features, effectively supports point inquiry, section The approximate calculation such as inquiry.
Brief description of the drawings
Fig. 1 is the compound summary institutional framework schematic diagram for supporting compression;
Fig. 2 is that time-based summary data compresses handling process;
Fig. 3 is the graph of a relation between summary data amount and relative error in compression process;
Fig. 4 is the time loss comparison diagram that summary data compresses under different error conditions.
Embodiment
In order to make the purpose , technical scheme and advantage of the present invention be clearer, below in conjunction with accompanying drawing, to according to this hair The Backup Data organization and management method of the level segmented of bright one embodiment is further described.
Big data application is general to have higher data throughput and the data scale of magnanimity simultaneously.Object in the present invention Object data item basic format is:Object:<key,value,Ts>.Wherein key:It is the unique identifier of data item; Value:Numerical attribute for statistics;Ts:Time caused by flag data, it is object time attribute.
Summary data structure is the kernel data structure in approximate calculation system.It is according to loading in big data system Data source, real-time servicing and the online data structure built.The inquiries such as inquiry such as sum, count, interval query can be supported Pattern.The present invention first by symmetrical wave sample based on specifically describe summary data structure (FS-Sketch), based on this structure be situated between The compression method that the summary data error that continues limits.Although the compression method of summary data is using specific summary data structure as case Example, but its method can be promoted the use of in other more general summary data structures.
1. support the compound summary data structure and its maintaining method of compression
It is a kind of compound summary data structure that can effectively manage two dimensions of key and time that the present invention provides first. Specifically, compound summary data structure needs the summary data for each object object to establish a time-tracking device (tracker).Value property values of the corresponding object in different time points is recorded by way of sampling in tracker. The compound summary data structure of one typical two dimension is as shown in Figure 1.
When data write first with the keyword of data item, using hash algorithm, specific tracker is positioned at, so The method of sampling limited afterwards in each tracker according to error carries out data sampling.When data query, using same Hash algorithm is positioned at specific tracker, and is estimated in the sample set of tracker, returns to approximate calculation result.Tool There is the compound summary data structure of features described above, can effectively manage the data of two dimensions of time and key, be a kind of open Basic Profile data structure.The different method of samplings, hash algorithm can apply with above-mentioned basic structure, corresponding to support Approximate calculation operates.The present invention introduces the compression method of summary data based on above-mentioned basic structure.
2. the summary data compression method based on time dimension
Summary data compression method based on time dimension is in time dimension the sample set in Fig. 1 in each tracker On degree, multiple time phases (phase) are further divided into, each time phase covers the time interval of certain length, if containing The time interval length of lid is TL.Different time phases can set different error parameters.Answered for freshness is sensitive For, newly arrived data general value is higher, and long-standing data typically should after certain damped cycle It is relatively low with being worth.Newly arrived data are written to first time phase, are sampled with error parameter ξ.When first When the time span that sample in time phase is covered reaches threshold value TL, with identical error parameter ξ create one it is new when Between the stage, receive the data that newly write.If handled without compression, system needs to safeguard whole samples that error parameter is ξ Notebook data.Summary data compression method core concept based on time dimension is, different time phases is according to the error of configuration Parameter, a certain amount of sample data is deleted from the sample set of old time phase, under conditions of estimation error is ensured, carried High storage efficiency.
If the compression parameters of summary data are r, and r>1, i-th of time phase is phasei, its corresponding time range For [StartTsi,EndTsi], for some time phase phaseiIn simplify a certain amount of sample data, reach its error Parameter is rh-1ξ, wherein h are the distance parameters away from the stage very first time, and parameter ξ is the error parameter of first time phase.With Compression parameters are that r summary data compression process is as shown in Figure 2.
Because data write-in always occurs in first time phase, the compression process of summary data can be with writing concurrently Perform, therefore the above method can effectively improve the write efficiency of streaming big data.
3. the summary data compression method based on key dimensions
Summary data compression method based on key dimensions is that multiple tracker to be compressed are formed a set, then The sample data recorded during this is gathered in each tracker is written in a newly-established tracker sequentially in time. The process of write-in is to update principle according to summary data, i.e., carries out double sampling with certain error parameter, can further All new write-in sample datas are organized into multiple time phases sequentially in time, further to support in step (2) Summary data compression based on time dimension.
If tracker to be combined is keyi~keyn, corresponding tracker collection is combined into { trackeri|1<i<N }, new wound The tracker built is trackernew, trackernewError parameter be ξnew.{ trackeri|1<i<N } in record sample Notebook data, it is being written to tracker sequentially in timenewIn, the compound summary number that is provided in the ablation process of data such as (1) According to shown in structure.
The compression summary data structure based on key can be obtained.We, which provide, below compresses later summary data structure The estimation error estimation provided.If summary data set { tracker to be combinedi|1<i<N } in each tracker mistake Difference is identical, is ξ, error corresponding to the tracker newly created is ξnew, then it is ξ to compress error corresponding to later summary data +ξnew+ξ*ξnew, this value can approximately be equal to ξ+ξnew
As ξ=ξnew{ the tracker merged can be directed to by compressing later summary datai|1<i<N } polymerizing value provide 2 times of approximate estimation results of former error.
Specific embodiment
The above method gives the summary data compression process for supporting that error limits.The above method is based on a kind of open Compound summary data illustrate, the above method and conclusion are applied to some current conventional summary data structures, such as: Determine a variety of summary data structures based on ripple sampling, random wave sampling, stochastical sampling and index histogram etc..In order to prominent Go out the availability of method, ripple sampling algorithm, the embodiment for providing the above method are determined with reference to specific.
(1) building process of compound summary data structure is realized
It is the method for sampling of generally use in flow data sliding window to determine ripple Sampling techniques.Needed in each tracker Record data item<N,value,Ts>, wherein N is the value of current all write-in data polymerizing value, i.e., Ts is the time caused by flag data.Determine that ripple is sampled according to N, the sample data of acquisition is that layering is placed.If sample is put The number of plies level put is ln, and the data newly write are subsequently placed in ln layers so that timestamp is incremental.
(2) compression process of the summary data based on time dimensions is realized
It is determined that in ripple sampling, the number of samples of the maintenance of each level isIt is individual, when sample data surpasses When crossing this number, the older sample data of the timestamp of each level is discarded.According to the compression method based on time attribute, New error parameter (ξ ') can be used to calculate the maximum sample number in current levelSurpass for position The sample for crossing m ' positions directly discards, and realizes limit error sample compression, and from calculation formula, this process is in sample data Measure it is larger be can be linear lifting data space.
(3) compression process of the summary data based on key dimensions is realized
Compression method based on key, tracker similar in long-term no and polymerizing value can be selected to build collection to be compressed Close { trackeri|1<i<n}.Then the data in all tracker are sequentially played back to one according to sample time attribute Ts The individual tracker newly builtnewIt is interior.trackernewKey1~keyn polymerization can be supported to estimate, estimation error meets analysis As a result.
Experimental data and conclusion
It is as follows according to present disclosure contrived experiment:To the network web sites access log page of wikipedia issue View is as test data set, the experimental selection web log of 8 days, nearly 90GB original data volume.Analyzed in experiment general The efficiency for wanting the compression time expense of data to be lifted with memory space after compression.
Fig. 3 provides the time loss of summary data compression.It can be drawn by diagram, for test data 90GB, produced Summary data 1GB test environment under, summary data space using sample number represent.When summary data compression process In, when relative error changes to 0.45 by 0.05, number of samples is reduced to 1.6M from 14M, and relative error is presented with sample size Go out a kind of variation relation of approximately linear.Fig. 4 further provides time overhead in the process, can be drawn by diagram general The data compression process time is wanted as average out to 150ms or so, therefore summary data compression method proposed by the present invention, can be efficient Operate in the approximate calculation system under big data environment, while ensure that calculation error, can linear lifting deposit Store up space availability ratio.
Described above is only the general introduction of technical solution of the present invention, but it can not be limiting the present invention.Belonging to the present invention Those of ordinary skill in technical field, without departing from the spirit and scope of the present invention, a little change and modification are done, is all existed In protection scope of the present invention.Therefore protection scope of the present invention is when as defined in claim.

Claims (7)

1. a kind of controllable summary data compression method of estimation error, its step are:
1) a time tracker tracker is established to each object object summary data;For summary data to be written, The object object according to corresponding to summary data navigates to corresponding time-tracking device tracker, then time-tracking device Tracker is sampled to corresponding summary data according to the method for sampling that error limits and preserved to the time-tracking device In sample set corresponding to tracker;
2) by each time-tracking device trackeriSample set i in sample multiple time ranks are divided on time dimension Section simultaneously sets the error parameter of each time phase;Then time-tracking device trackeriAccording to corresponding to i-th of time phase Error parameter ξiThe sample of the i-th -1 time phase is sampled;
3) sample set after step 2) processing is merged into a sample set H, then by the sample number in sample set H According to being written to a time-tracking device tracker sequentially in timenewSample set in;Wherein in ablation process, the time Tracker trackernewMultiple time phases are divided on time dimension to the sample in sample set H and set each The error parameter of time phase;Then error parameter is carried out to the sample of corresponding time phase according to corresponding to each time phase Sampling.
2. the method as described in claim 1, it is characterised in that the method to set up of the error parameter of each time phase is:If I-th of time-tracking device trackeriSample set in j-th of time phase phasei,jError be ξi,j, then ξi,j=rh* ξ;Wherein, compression parameters r, and r>1, ξ is the error parameter of first time phase, h=(Tsmax-StartTs)/TL, TL For time phase phasei,jTime interval length, Tsmax is time-tracking device trackeriMaximum time stamp, StartTs For time phase phasei,jInitial time.
3. the method as described in claim 1, it is characterised in that the error parameter of each time phase is identical.
4. the method as described in claim 1 or 2 or 3, it is characterised in that the time-tracking device trackeriSample set In sample data structure be:<N,value,Ts>, wherein N is the value of current all write-in data polymerizing value, i.e.,valueiFor the value values of i-th of summary data, value is for the numerical value counted, Ts in summary data For the time caused by flag data.
5. method as claimed in claim 4, it is characterised in that the number of samples of the maintenance in each layering isDirectly discarded for sample of the position more than m ' positions.
6. method as claimed in claim 5, it is characterised in that the time-tracking device trackeriUsing determination ripple sampling side Method is sampled according to N values, and the sample data of acquisition is placed on into the time-tracking with timestamp incremental order layering Device trackeriSample set in.
7. the method as described in claim 1 or 2 or 3, it is characterised in that the method for sampling is:Determine the ripple method of sampling, with The machine ripple method of sampling, stochastical sampling method or the index histogram method of sampling.
CN201510254377.4A 2015-05-18 2015-05-18 A kind of controllable summary data compression method of estimation error Active CN104935348B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510254377.4A CN104935348B (en) 2015-05-18 2015-05-18 A kind of controllable summary data compression method of estimation error

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510254377.4A CN104935348B (en) 2015-05-18 2015-05-18 A kind of controllable summary data compression method of estimation error

Publications (2)

Publication Number Publication Date
CN104935348A CN104935348A (en) 2015-09-23
CN104935348B true CN104935348B (en) 2018-01-05

Family

ID=54122344

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510254377.4A Active CN104935348B (en) 2015-05-18 2015-05-18 A kind of controllable summary data compression method of estimation error

Country Status (1)

Country Link
CN (1) CN104935348B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI241502B (en) * 2002-12-26 2005-10-11 Ind Tech Res Inst Real time data compression apparatus for a data recorder
CN101499097A (en) * 2009-03-16 2009-08-05 浙江工商大学 Hash table based data stream frequent pattern internal memory compression and storage method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI241502B (en) * 2002-12-26 2005-10-11 Ind Tech Res Inst Real time data compression apparatus for a data recorder
CN101499097A (en) * 2009-03-16 2009-08-05 浙江工商大学 Hash table based data stream frequent pattern internal memory compression and storage method

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
A Time Based Analysis of Data Processing on;Amrit Pal etc.;《2014 sixth International Conference on Computational Intelligence and Communication Networks》;20150326;第608-612页 *
Efficient Computation of k-Medians over Data Streams Under Memory;崇志宏等;《计算机科学技术学报(英文版)》;20060331;第21卷(第2期);第284-296页 *
一种基于可变滑动窗口的数据流;栗磊等;《科学技术与工程》;20140331;第14卷(第9期);第221-226页 *
基于概要的数据流管理系统的研究与实现;刘畅;《中国优秀硕士学位论文全文数据库 信息科技辑》;20120515(第5期);摘要,第1-53页 *

Also Published As

Publication number Publication date
CN104935348A (en) 2015-09-23

Similar Documents

Publication Publication Date Title
US11934409B2 (en) Continuous functions in a time-series database
US10229129B2 (en) Method and apparatus for managing time series database
Fatima et al. Comparison of SQL, NoSQL and NewSQL databases for internet of things
AU2019333921B2 (en) Method and system for indexing of time-series data
US20200167360A1 (en) Scalable architecture for a distributed time-series database
EP3299972B1 (en) Efficient query processing using histograms in a columnar database
US20200167355A1 (en) Edge processing in a distributed time-series database
US10417265B2 (en) High performance parallel indexing for forensics and electronic discovery
US20180004441A1 (en) Information processing apparatus, computer-readable recording medium having storage control program stored therein, and method of controlling storage
WO2018113317A1 (en) Data migration method, apparatus, and system
Agrawal et al. Low-latency analytics on colossal data streams with summarystore
Chen Fuzzy testing of operating performance index based on confidence intervals
CN104951503B (en) A kind of sensitive big data summary info of freshness is safeguarded and polymerizing value querying method
CN106649687B (en) Big data online analysis processing method and device
US10176231B2 (en) Estimating most frequent values for a data set
WO2015168988A1 (en) Data index creation method and device, and computer storage medium
Ma et al. FILM: a fully learned index for larger-than-memory databases
Beyer et al. Distinct-value synopses for multiset operations
CN104935348B (en) A kind of controllable summary data compression method of estimation error
US8548980B2 (en) Accelerating queries based on exact knowledge of specific rows satisfying local conditions
US11645283B2 (en) Predictive query processing
CN115469810A (en) Data acquisition method, device, equipment and storage medium
Belabbess et al. Combining machine learning and semantics for anomaly detection
Bai New query and analytics over large sequence data: a study on streaks and stream
Rajadnye Is Datawarehouse Relevant in the Era of Big Data?

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant