CN102521406B - Distributed query method and system for complex task of querying massive structured data - Google Patents

Distributed query method and system for complex task of querying massive structured data Download PDF

Info

Publication number
CN102521406B
CN102521406B CN201110442091.0A CN201110442091A CN102521406B CN 102521406 B CN102521406 B CN 102521406B CN 201110442091 A CN201110442091 A CN 201110442091A CN 102521406 B CN102521406 B CN 102521406B
Authority
CN
China
Prior art keywords
data
query
distributed
window
inquiry
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201110442091.0A
Other languages
Chinese (zh)
Other versions
CN102521406A (en
Inventor
吴广君
李超
王树鹏
云晓春
王勇
李斌斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guoxin electronic bill Platform Information Service Co., Ltd.
Original Assignee
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS filed Critical Institute of Computing Technology of CAS
Priority to CN201110442091.0A priority Critical patent/CN102521406B/en
Publication of CN102521406A publication Critical patent/CN102521406A/en
Application granted granted Critical
Publication of CN102521406B publication Critical patent/CN102521406B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention provides a distributed query method and a distributed query system for a complex task of querying massive structured data. The distributed query method for the massive structured data comprises the following steps of: receiving a query task from a user, and decomposing the query task into a plurality of query subtasks; and concurrently querying the data which is stored in a distributed way in batches according to each of the plurality of query subtasks, and returning queried result sets in the distributed way. According to the query method, batch query is adopted, and an intermediate result state is kept, so that the requirements of interface display application for the quick query of small data volumes are fully taken into account, and simultaneously, the counting requirement of a big result set under a counting and analysis background is combined.

Description

The distributed enquiring method of complex task of querying massive structured data and system
Technical field
The present invention relates to the Mass Data Management system and method in a kind of information security field, relate more specifically to inquiry and DDM method and system towards complex task of querying, be mainly used in the application such as statistics, analysis of landing storage, analysis and massive logs data of information security field network message.
Background technology
In Contemporary Information security fields, data management is no longer confined to simple data processing method such as traditional data sampling, analysis etc., but by efficient data-storage system, carries out data and lands storage, and support the function such as data statistics, analysis of complexity afterwards.
Because the relevant database generally using is at present subject to consistency constraint, therefore under the condition of mass data storage and inquiry, loading efficiency is low, retrieval rate is slow for the querying method based on relevant database and inquiry system, and cannot realize the target of the smooth expansion of system.In order to adapt to the search efficiency of the application demand such as mass data storage, inquiry, raising data, propose a kind of based on the Hadoop distributed NO-SQL database (also referred to as KEY-VALUE type database) of increasing income, such as Hbase, Hypertable etc., these databases, by reducing consistency constraint, have improved storage size and the data-handling efficiency of system.But the NO-SQL database based on Hadoop only can provide KEY-VALUE query pattern, according to given KEY value, search corresponding VALUE value or value interval, therefore, statistics, the analysis and consult function of the complex conditions that it cannot satisfying magnanimity structural data.
In prior art, for the inquiry of massive structured data, have a kind of Distributed Data Warehouse HIVE realizing based on Hadoop and querying method and the inquiry system based on HIVE, it can support more complete complicated SQL query.Although HIVE can support complicated SQL query, it has the following disadvantages:
(1) HIVE just returns to Query Result user's use after need to inquiring all records that satisfy condition, and need to wait for that a large amount of time could obtain result if result set is crossed large user.Therefore cause HIVE real-time query efficiency lower, postpone greatly, cannot realize online data and load and the object of fast query, cannot meet the inquiry application of this not needs of showing interface large result collection.
(2) in HIVE, there is no index, its all query manipulation is all carried out by reading raw data file.Therefore, search efficiency is low.
(3) query script of HIVE is that user passes through HQL language (a kind of query express mode that is similar to sql like language) description rule searching, although can describe more complicated correlation inquiry, the equivalent JOIN of connection by HQL language, but because it is mainly that Task-decomposing method towards MapReduce is (in the time carrying out query task, need repeatedly disk write, read operation), therefore its execution efficiency is low, and it cannot be directly used in the inquiry of stream record data.
(4) HIVE reading out data from data file, the therefore frequent load mode of record support streaming not, does not support buffer memory to load data or cache lookup data yet.Although storage system has the function of opening up buffer structure, improving data loading efficiency in prior art, but need to wait until and data cachedly could support query manipulation after being written in disk, streaming record data application scenario is continual being loaded in system, and data have higher frequency of utilization relatively in the recent period, therefore classic method cannot meet the demand of inquiry.
Therefore,, at the inquiry for massive structured data and management domain, need a kind of method and system that can support complex query condition and can realize fast query object badly.
Summary of the invention
The technical problem to be solved in the present invention is to provide a kind of massive structured data distributed enquiring method and the system that can support complicated SQL query and can realize fast query.
According to an aspect of the present invention, propose a kind of distributed enquiring method of massive structured data, comprising: step 1, the query task that reception user sends is also decomposed into multiple queries subtask by query task; And step 2, according to the each inquiry subtask in the plurality of inquiry subtask, the Data Concurrent of distributed storage is carried out to inquiry in batches, and the distributed result set inquiring that returns.
According to a further aspect in the invention, propose a kind of massive structured data distributed Query Processing System, comprising: for receiving the query task that user sends the device that query task is decomposed into multiple queries subtask; And for each group of Data Concurrent of distributed storage being carried out to inquiry in batches, and the distributed device that returns to the result set inquiring according to each inquiry subtask of the plurality of inquiry subtask.
The querying method of inquiring about in batches and keep intermediate result state that the present invention adopts, has taken into full account the demand of small data quantity fast query in showing interface application, has also taken into account the statistical demand of Mass Result under statistics, analysis background simultaneously.
Brief description of the drawings
Fig. 1 is according to the process flow diagram of the massive structured data distributed enquiring method of first embodiment of the invention.
Fig. 2 is the process flow diagram of inquiring about in batches and return results in batches collection to being stored in data in hard disk of an example according to the present invention.
Fig. 3 is the process flow diagram that collection was inquired about in batches and returned results in batches in the execution of another example according to the present invention.
Fig. 4 is the process flow diagram that each step that query results is gathered is shown.
Fig. 5 is the schematic diagram of two moving window structures and principle of work thereof.
Fig. 6 is according to the process flow diagram of the massive structured data distributed storage method of second embodiment of the invention.
Fig. 7 is according to the process flow diagram of the massive structured data distributed enquiring method of second embodiment of the invention.
Embodiment
In the distributed storage method and querying method of massive structured data of the present invention, the data structure adopting comprises two essential parts: full ranking index and record data.Full ranking index is entirely to sort according to lexcographical order recording all property values.Record data are that every record is stored according to the order of sequence with behavior unit.Class querying condition is filtered in full ranking index support, as the querying condition in WHERE.
Before the present invention is elaborated, first to related notion related in the present invention, " inquiry in batches " defines.Inquiry in batches refers to the query task to having a large amount of query resultses, the repeatedly inquiry of selecting to obtain the single inquiry of a small amount of result set or obtaining all result sets according to user's request.
Below in conjunction with the drawings and specific embodiments, the present invention is illustrated.
In the querying method for massive structured data, conventionally adopt Distributed Storage structure to realize the inquiry of mass data.
Fig. 1 is according to the process flow diagram of the massive structured data distributed enquiring method of the embodiment of the present invention.As shown in Figure 1, the distributed enquiring method of massive structured data of the present invention is mainly for stream record data, and the method comprises the following steps:
Step 1, the query task that reception user sends is also decomposed into multiple queries subtask by query task.
In mass data query script, if for carrying out query task according to serial mode between all memory storages of concrete querying condition, cannot give play to the overall calculation ability of distributed system.Therefore, in order to improve the search efficiency of mass data under distributed environment, the present invention decomposes concrete query task, and the subtask after decomposing is sent to and on each memory storage, carries out concurrent execution.
According to one embodiment of present invention, can query task be decomposed into multiple queries subtask according to subregion class querying condition, filtration class querying condition or global statistics analysis classes querying condition.Wherein, subregion class querying condition is to carry out other inquiry of data file level, can subregion class querying condition be set according to the index type of stored data.For example, in the present invention, as an example of time attribute example as storage data are set up centralized indexes, (basic Organization of Data rule is, according to time attribute, data are carried out to piecemeal storage, between deblocking, ensure the order of time attribute, and build B+Tree index and support the inquiry of unified data file, thereby realize the file-level subregion fast finding based on time attribute), can select time attribute as subregion querying condition, by executable operations in the B+tree index of setting up based on time attribute, carry out subregion querying condition; Filter class querying condition and filter or mate for the physical record in target index file, this class condition can concurrent execution in multiple memory storages.Data statistic analysis class querying condition, need to unify to process for last result set the correctness of guarantee query semantics.
Except the decomposition method of above-mentioned query task, it will be appreciated by those skilled in the art that and can also decompose query task according to the querying condition of other type, to improve the search efficiency of mass data under distributed environment.
The present invention is by decomposing mechanism and concurrent subquery Task Scheduling Mechanism in distributed foundation towards the query task of complex conditions, thereby make full use of the computational resource under distributed environment, concurrent execution inquiry subtask, to improve the search efficiency of massive structured data.
Step 2, according to the each inquiry subtask in the plurality of inquiry subtask, carries out inquiry in batches to the Data Concurrent of distributed storage, and the distributed result set inquiring that returns.
In mass data query script, may produce a large amount of result sets according to user's querying condition, even reach more than one hundred million records.Process so large-scale result set and can take the plenty of time, record showing interface for for example for example hundreds of bar that only needs to return taking B/S as application background, by all result sets all return user be there is no need and wasted valuable data processing time.For this reason, the present invention proposes the method for inquiry in batches, need the needs of fast return result set to be applicable to mass data inquiry.
Fig. 2 is the process flow diagram of inquiring about in batches and return results in batches collection to being stored in data in data storage device of an example according to the present invention.As shown in Figure 2, it comprises the following steps:
Step 211 for query manipulation arranges the maximum return recording number of single (also referred to as threshold value), for example, is 1,000,000 by this threshold value setting.
Step 212, according to inquiry subtask, inquires about the data that are stored in each memory storage, and obtains query results based on this threshold value.Comprising: in the time that reaching threshold value, the record count that meets querying condition inquiring obtains result set, for example, when inquiring 1,000,000 while meeting the recording of querying condition, these 1,000,000 records are returned, although also comprise the record that meets querying condition inquiring do not reach threshold value (for example 1,000,000) when completed whole memory storage inquiry time the result set that obtains.
Step 213, judge whether the record count in this result set reaches the maximum return recording number of single, if do not reach the maximum return recording number of single, explanation has been carried out comprehensive inquiry and has been obtained all result sets that meet querying condition whole data storage device, thereby execution step 215, if reach the maximum return recording number of single, the mark of " not inquiry comprehensively " is returned to user, determine whether to proceed inquiry by user, if need to continue inquiry, perform step 214, otherwise execution step 215.
Step 214, preserves current query State and proceeds inquiry based on this query State.In the present invention, by an identifier Session ID relevant to inquiry subtask being provided for each inquiry subtask and preserving query State according to Session ID.Particularly, for the inquiry of multiple batches of same query task, its Session ID is identical, therefore in the time receiving inquiry subtask, Session ID in inquiry subtask is mated with the query State information of preservation, if Session ID is identical, utilize the query State information that comprises Session ID of having preserved, the data that do not inquire in inquiry last time are inquired about, until user has obtained required whole result sets.
Step 215, returns to user by the result set inquiring.
According to another embodiment of the present invention, can also be carried out inquiry in batches and be returned results collection in batches by the step as shown in Fig. 3 process flow diagram, it comprises:
Step D2100 for query manipulation arranges the maximum return recording number of single (also referred to as threshold value), for example, is 1,000,000 by this threshold value setting.
Step D2200, receives multiple queries subtask, by subregion querying condition, obtains target index burst, and on each index burst, class querying condition is filtered in concurrent execution; And obtain the result set that satisfies condition.
Step D2300, judges in inquiry subtask whether have packet command GROUP BY, if there is packet command, performs step D2400, otherwise execution step D2500.
Step D2400, utilizes the quick judged result collection of hash algorithm whether to belong to same grouping.The concrete grammar of Hash grouping is: use Hash to carry out Hash calculating to the attribute of needs grouping, the hash value obtaining is indicated as a barrel number, in the bucket indicating, place the hash value record identical with barrel number.Owing to being the record that hash value is identical in each bucket, therefore can realize the fast grouping operation to record within O (1) time.
Step D2500, judges in inquiry subtask whether have duplicate removal order, i.e. key word DISTINCT, if had, performs step D2600's; Otherwise execution step D2700.
Step D2600, carries out duplicate removal to record, distinguish DISTINCT appear at whole piece record before, type as " SELECT DISTINCT ... ", or for static fields duplicate removal, type as " SELECT SUM (DISTINCT name) ... " For type as " SELECT DISTINCT ... " order, represent record duplicate removal for whole piece, type as " SELECT SUM (DISTINCT name) ... " represent to add up (SUM and GROUP BY field occur simultaneously conventionally) for the field duplicate removal after grouping name.In order to improve counting yield, when data duplicate removal, utilize bloom filter to accelerate the judgement of Repeating Field.
Step D2700, judge and in querying condition, whether have SELECT ... LIMIT K, general K very little (such as K=100), be far smaller than the maximum return recording number of single (1,000,000), if had, in the time of concurrent inquiry, each inquiry subtask stops inquiry after inquiring the K bar record satisfying condition, execution step D2810; Otherwise execution step D2800.This step is to record in application scenario for stream, only inquires about qualified sub-fraction market demand situation and designs, and in inquiry, even without the threshold value reaching in batches, data storage device is for this type of inquiry also caching query state no longer.
Step D2800, whether the record count that judgement inquires reaches threshold value (for example 1,000,000), if reached threshold value, the mark of " not inquiry comprehensively " is returned to user, determine whether according to actual needs to continue inquiry by user, if determine and continue inquiry, perform step D2820; If determine that the record count that does not continue inquiry or inquire does not reach threshold value, be the single batch of result set returning, execution step D2810.
Step D2810, for the single batch of result set returning, has judged whether statistical function SUM, COUNT, AVG, MAX, MIN function, if there is execution step D2811, otherwise execution step D2812.
Step D2811, according to statistical function order, calculates concrete numerical value, and SUM calculates concrete numerical value to the field after grouping, and COUNT is statistic record number, and AVG is calculating mean value, and MAX records maximal value, and MIN records minimum value.
Step D2812, if there is sorting operation order, carries out data sorting.The function of sequence comprises: for the sequence of some fields; Or sort through the result data of the statistical function calculating such as SUM, COUNT, AVG, key word ASC represents ascending sort, DESC represents descending sort, is defaulted as ascending sort.
Step D2813, returns to result set.
Step D2820, is temporary in the intermediate result file of this inquiry in temporary file, after supporting to merge with the Query Result of next batch, obtains last Query Result file.
Step D3821, judges whether it is the new inquiry request of initiating according to the SessionID in inquiry subtask.While sending inquiry subtask, in inquiry subtask, comprise SessionID, the SessionID in same inquiry subtask is identical at every turn.
Step D2822, has judged whether duplicate removal order; If there is duplicate removal command execution step D2823, otherwise execution step D2824.
Step D2823, according to the position of DISTINCT key word, carries out field duplicate removal by bloom filter.
Step D2824, has judged whether packet command, if there is packet command execution step D2825; Otherwise execution step D2826.
Step D2825, statistical packet process is reading result collection from temporary file first, because the result set of preserving in temporary file is data through dividing into groups last time after computing, only need the Article 1 record of more each grouping, the data that judge this batch whether with upper batch statistics after result set belong to same grouping.
Step D2826, if whether have statistical function SUM, COUNT, AVG, MAX, MIN to have, performs step D2827; Otherwise execution step D2828.
Step D2827, according to statistical function order, calculates concrete numerical value, and SUM calculates concrete numerical value to the field after dividing into groups, COUNT statistic record number, and AVG calculating mean value, MAX selects maximal value, and MIN selects minimum value.In computation process, relate to the union operation between two batches.For example, result direct for COUNT, SUM and that last batch is calculated is cumulative; For MAX, MIN statistical function choose current batch with maximum or the minimum value of upper batch.
Step D2828, whether judgement in batches inquiry finishes, judge whether to have inquired about all data sets, if do not finished, execution step D2820; Otherwise execution step D2829.
Step D2829, carries out overall situation sequence to last result set, and sequence content can sort for some fields, can sort for result of calculations such as aggregate function SUM, COUNT, AVG, and default result set adopts ascending sort rule.
Step D2830, is converted into final destination file the result of statistics by temporary file, and carries out the derivation of data according to the form of specifying, to return results collection.
According to still another embodiment of the invention, the distributed enquiring method of massive structured data can also comprise: step 3, the result set that distributed query is returned gathers, and this result set is presented to user.
Fig. 4 illustrates the process flow diagram of each step that query results is gathered according to an embodiment of the invention.As shown in Figure 4, comprising:
Step 311, the distributed result set returning is merged, Group By, Order By, SUM, COUNT, AVG, TOP, LIMIT, MAX, MIN etc. afterwards, judge whether global statistics, analysis classes querying condition, as need to be gathered the querying command of rear execution analysis class; If had, execution step 312, if do not perform step 313.
Step 312, according to Statistic Query order, carries out global statistics, analysis classes operation.
Step 313, the form that the result set generating is required according to user, path generate destination file, and this destination file are presented to user.For this reason, the invention provides paging query mechanism, so-called paging query mechanism refer to user arbitrarily a certain partitioned data set (PDS) in set of displayable data carry out showing interface, as through after step 313, a total i bar record in the destination file generating, on interface, every page can show that (i > j), one has j bar record page (getting the upper integer of i/j), the user as required concentrated any one page of direct selection result shows.Afterwards, exit query manipulation.
Massive structured data distributed enquiring method of the present invention, inquire about in batches and return results collection owing to having adopted, therefore do not needing under the application of large result collection, can make query results reach after certain threshold value, directly return to user side, also can as required Query Result be kept in to support query manipulation in batches.Therefore for the inquiry application of showing interface, select single batch to return results collection for showing on interface; And for the inquiry application with data statistic analysis function, by the query State of preserving, carry out multiple batches of inquiry, until inquire all result sets that satisfy condition.Therefore in Mass Data Management system, for example, but the present invention can meet little needs of demand data amount to the returning results collection demand (showing interface query type) of this type application of response fast, also can meet the demand (for example information analysis as application background with data mining) of the not high but needs of response time requirement being returned to data statistic analysis that large result integrates and inquired about this type application.
Owing to having set up index and set up complex query condition and decomposed and concurrent subquery Task Scheduling Mechanism in distributed massive structured data storage system, therefore, the present invention is by making full use of the computational resource under distributed environment, subtask is inquired about in concurrent execution, thereby has improved the search efficiency of massive structured data.
Under the application backgrounds such as flow data, load in the recent period data and there is very high frequency of utilization.Based on this, according to one embodiment of present invention, a kind of date storage method is proposed, utilize two moving window structures that the data that load are in the recent period carried out to buffer memory, and by the data of institute's buffer memory are inquired about, to improve the search efficiency under flow data application.
Fig. 5 is the schematic diagram of two moving window structures and principle of work thereof.As shown in Figure 5, this pair of moving window structure comprises: data write window and data query window.Wherein, data write the data of window reception real-time loading, for data are set up index, and revise relevant system metadata information; Data query window receives inquiry subtask, and the data of directly inquiring about this window according to the metadata record information of setting up.Wherein, data query window and data write window and carry out streaming transformation according to the time cycle.
For example time window is set to 5 minutes.As shown in Figure 5, in the time that the time reaches 5 minutes, what data write the interior preservation of window is the data in complete 5 minutes, and now data write window and no longer write new data, and window role is converted into data query window.Again open up new buffer structure and receive loading data, generate new data and write window.In data query window, ensure record data in complete 5 minutes, receive querying command, the data of directly inquiring about this window according to metadata record information, return to the result set satisfying condition.When again through a write cycle, have new data query window and produce.Now old data query window no longer provides data query service, but by network, the batch data of buffer memory in this window is stored to (for example storing in hard disk).All write and reclaim the resource that this window uses after being stored until the data in window.Thereby the role who realizes between multiple windows changes according to the streaming of time cycle.
Based on above-mentioned pair of moving window structure, according to one embodiment of present invention, a kind of massive structured data distributed storage method is proposed.As shown in Figure 6, massive structured data distributed storage method according to the present invention comprises:
Step 1, from the data of user side reception High speed load.
Step 2, sets up two moving window structures with fixing polling cycle, and so that the data of loading are carried out to buffer memory, this pair of moving window structure also enables data cached inquiry in the data query step in later stage.
Data buffer storage device utilizes cache device receive the data of real-time loading and carry out local cache.Conventionally data buffer storage device is realized by opening up large storage space internal memory or solid state hard disc.Through cycle regular time, data buffer storage device can be written to the data of buffer memory in data storage management device, realizes lasting data storage.
By massive structured data distributed storage method of the present invention, can realize the buffer memory to new loading data, thereby it is this to the search efficiency under the high application of recent loading data frequency of utilization to improve flow data in the later stage when data query.
According to one embodiment of present invention, can be based on the above-mentioned distributed storage method that the data that newly load is carried out to the massive structured data of buffer memory, when the Data Concurrent to distributed storage is carried out inquiry in batches, carry out inquiry in batches to realize every group of Data Concurrent of distributed caching by two moving window structures, and the distributed collection that returns results.
Based on above-mentioned distributed data storage method, according to one embodiment of present invention, the distributed enquiring method of another kind of massive structured data is proposed.As shown in Figure 7, this querying method comprises:
Step 1, the query task that reception user sends is also decomposed into multiple queries subtask by query task.
Step 2, according to the plurality of inquiry subtask, carries out inquiry in batches to every group of Data Concurrent of distributed caching, and the distributed result set inquiring that returns.
Wherein to also can be according to subregion class querying condition in the step that query task is decomposed into multiple queries subtask, filter class querying condition or global statistics, analysis classes querying condition be decomposed into query task the step of multiple queries subtask, thereby in the time inquiring about, further improve search efficiency.
According to another embodiment of the present invention, the distributed enquiring method of this massive structured data also comprises: step 3, gathers the query results returning in batches, and this result set is presented to user.
According to still another embodiment of the invention, the distributed enquiring method of this massive structured data is also included in when every group of data parallel of distributed caching carried out to inquiry in batches and also every group of data parallel of distributed storage is carried out to inquiry in batches.
According to another embodiment of the present invention, also can adopt mode as shown in Figures 2 and 3 to inquire about every group of data of distributed caching in batches.
The present invention is by directly carrying out buffer memory to loading data, and the data of buffer memory are directly carried out to query manipulation, therefore after waiting until the unified storage of all data, carry out again inquiry, thereby especially improved for example, search efficiency under this application of data frequency of utilization very high (daily record class stream record data) of recent storage.
Certainly, those skilled in the art are also appreciated that when previously described every group of data for distributed storage are carried out inquiry in batches
It should be noted last that, above embodiment is only unrestricted in order to technical scheme of the present invention to be described.Although the present invention is had been described in detail with reference to embodiment, those of ordinary skill in the art is to be understood that, technical scheme of the present invention is modified or is equal to replacement, do not depart from the spirit and scope of technical solution of the present invention, it all should be encompassed in the middle of claim scope of the present invention.

Claims (20)

1. the distributed enquiring method of a massive structured data, comprise: data load step and query steps, described data load step comprises: receive the data of High speed load from user side, utilize two moving window structures, the data of loading are carried out to distributed caching; Wherein this pair of moving window structure comprises that data write window and data query window, and these data write window for receiving the data of real-time loading, and revises relevant system metadata information; This data query window is used for receiving inquiry subtask, and the data of directly inquiring about this window according to the metadata record information of setting up; Described data query window and data write window and carry out streaming transformation according to the time cycle, the mode of carrying out streaming transformation is as follows: a write cycle is while completing, the role that current data write window is converted into data query window, again open up new buffer structure and receive loading data, generate new data and write window; When again through a write cycle, new data query window produces, now old data query window no longer provides data query service, reclaims the resource that this old data query window uses, thereby the role who realizes between multiple windows changes according to the streaming of time cycle;
Described query steps comprises:
Step 1, the query task that reception user sends is also decomposed into multiple queries subtask according to querying condition by query task; And
Step 2, according to the each inquiry subtask in the plurality of inquiry subtask, each group of Data Concurrent of distributed storage carried out to inquiry in batches, and the distributed result set inquiring that returns, wherein, described inquiry subtask is received by described data query window, and described data query window directly carries out query manipulation according to the metadata record information of setting up to the data of this windows cache.
2. distributed enquiring method as claimed in claim 1, also comprises: step 3, gathers the distributed result set returning, and the result set after gathering is presented to user.
3. distributed enquiring method as claimed in claim 1 or 2, wherein step 1 comprises according to subregion class querying condition, filters class querying condition or global statistics analysis classes querying condition query task is decomposed into multiple queries subtask.
4. distributed enquiring method as claimed in claim 3, wherein arranges subregion class querying condition according to the index type of the data of distributed storage.
5. distributed enquiring method as claimed in claim 4, wherein, according to the index of setting up with time attribute, arranges this subregion class querying condition with time attribute.
6. distributed enquiring method as claimed in claim 3, wherein each group of Data Concurrent of distributed storage carried out to the step of inquiring about in batches and comprise every group of data execution following steps for distributed storage:
Step 211 is the maximum return recording number of operation setting single of Querying Distributed storage data;
Step 212, inquires about the data of storage according to inquiry subtask, and based on the maximum return recording number of this single, obtains result set;
Step 213, judge whether the record count in this result set reaches the maximum return recording number of single, if do not reach the maximum return recording number of single, perform step 215, if reach the maximum return recording number of single, determine whether to proceed inquiry by user, be repeatedly inquiry if need to continue inquiry, perform step 214, otherwise be single inquiry, execution step 215;
Step 214, preserves current query State and proceeds inquiry based on this query State, until obtain the result set that comprises all records that meet querying condition; And
Step 215, returns to the result set inquiring.
7. distributed enquiring method as claimed in claim 6, wherein step 212 comprises that class querying condition is filtered in concurrent execution on each index burst, to obtain result set by subregion class querying condition acquisition target index burst.
8. distributed enquiring method as claimed in claim 7, wherein step 212 also comprises: after obtaining result set, carry out global statistics, analysis classes querying condition.
9. distributed enquiring method as claimed in claim 8, wherein carries out global statistics, analysis classes querying condition comprises: according to the packet command in inquiry subtask, utilize hash algorithm to carry out the operation of data set fast grouping.
10. distributed enquiring method as claimed in claim 8, wherein carries out global statistics, analysis classes querying condition also comprises: organize interior duplicate removal or overall duplicate removal according to the duplicate removal order in inquiry subtask.
11. distributed enquiring methods as claimed in claim 8, wherein carry out global statistics, analysis classes querying condition comprises according to statistical function SUM, COUNT, AVG, MAX, MIN Query Result counting statistics result, wherein SUM calculates concrete numerical value to the field after grouping, COUNT is statistic record number, AVG is calculating mean value, MAX is the maximal value in all records that inquire, and MIN is the minimum value in all records that inquire.
12. distributed enquiring methods as claimed in claim 6, wherein step 212 comprises: in the time that the record count that meets querying condition inquiring reaches the maximum return recording number of this single, obtain result set, although or the record that meets querying condition inquiring do not reach the result set having obtained when the maximum return recording number of this single has completed the inquiry to all storage data.
13. distributed enquiring methods as claimed in claim 6, wherein step 213 comprises: in the time that the record count in this result set equals the maximum return recording number of single, the mark of " not inquiry comprehensively " is returned to user, and user determines whether to continue inquiry according to actual needs based on this mark.
14. distributed enquiring methods as claimed in claim 6, wherein step 214 comprises: for each inquiry subtask provides a relative identifier, and preserve current query State according to this identifier.
15. distributed enquiring methods as claimed in claim 2, wherein step 3 comprises:
Step 311, merges the distributed result set returning;
Step 312, according to Statistic Query order, the result set after being combined is carried out global statistics, analysis classes operation; And
Step 313, the form that the result set generating is required according to user, path generate destination file, and this destination file are presented to user.
16. distributed enquiring methods as claimed in claim 6, after being also included in step 214, carry out overall situation sequence to all result sets.
17. distributed enquiring methods as claimed in claim 2, wherein step 3 comprises by paging query mechanism, and the result set paging after gathering is presented to user.
18. distributed enquiring methods as claimed in claim 1 or 2, wherein step 2 also comprises: when the Data Concurrent to distributed storage is carried out inquiry in batches, carry out inquiry in batches to realize every group of Data Concurrent of distributed caching by two moving window structures, and the distributed collection that returns results.
19. distributed enquiring methods as claimed in claim 18, wherein this pair of moving window structure comprises that data write window and data query window, these data write window and receive the data of real-time loading, and revise relevant system metadata information; This data query window receives inquiry subtask, and the data of directly inquiring about this window according to the metadata record information of setting up.
20. 1 kinds of massive structured data distributed Query Processing Systems, comprising:
Data loading device, for receive the data of High speed load from user side, utilizes two moving window structures, and the data of loading are carried out to distributed caching; Wherein this pair of moving window structure comprises that data write window and data query window, and these data write window for receiving the data of real-time loading, and revises relevant system metadata information; This data query window is used for receiving inquiry subtask, and the data of directly inquiring about this window according to the metadata record information of setting up; Described data query window and data write window and carry out streaming transformation according to the time cycle, the mode of carrying out streaming transformation is as follows: a write cycle is while completing, the role that current data write window is converted into data query window, again open up new buffer structure and receive loading data, generate new data and write window; When again through a write cycle, new data query window produces, now old data query window no longer provides data query service, reclaims the resource that this old data query window uses, thereby the role who realizes between multiple windows changes according to the streaming of time cycle;
For receiving the query task that user sends the device that according to querying condition, query task is decomposed into multiple queries subtask; And
For the Data Concurrent of distributed storage being carried out to inquiry in batches, and the distributed device that returns to the result set inquiring according to each inquiry subtask of the plurality of inquiry subtask.
CN201110442091.0A 2011-12-26 2011-12-26 Distributed query method and system for complex task of querying massive structured data Active CN102521406B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110442091.0A CN102521406B (en) 2011-12-26 2011-12-26 Distributed query method and system for complex task of querying massive structured data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110442091.0A CN102521406B (en) 2011-12-26 2011-12-26 Distributed query method and system for complex task of querying massive structured data

Publications (2)

Publication Number Publication Date
CN102521406A CN102521406A (en) 2012-06-27
CN102521406B true CN102521406B (en) 2014-06-25

Family

ID=46292319

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110442091.0A Active CN102521406B (en) 2011-12-26 2011-12-26 Distributed query method and system for complex task of querying massive structured data

Country Status (1)

Country Link
CN (1) CN102521406B (en)

Families Citing this family (37)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102841944A (en) * 2012-08-27 2012-12-26 南京云创存储科技有限公司 Method achieving real-time processing of big data
CN103034735B (en) * 2012-12-26 2017-02-08 北京讯鸟软件有限公司 Big data distributed file export method
CN103106261B (en) * 2013-01-28 2016-02-10 中国电子科技集团公司第二十八研究所 Based on the distributed enquiring method of arrowband cloud data, services
CN104252481B (en) * 2013-06-27 2018-10-19 阿里巴巴集团控股有限公司 The dynamic check method and apparatus of master-slave database consistency
CN103399944A (en) * 2013-08-14 2013-11-20 曙光信息产业(北京)有限公司 Implementation method and implementation device for data duplication elimination query
CN104572676B (en) * 2013-10-16 2017-11-17 中国银联股份有限公司 A kind of inter-library paging query method for multiple database table
CN103544259B (en) * 2013-10-16 2017-01-18 国家计算机网络与信息安全管理中心 Aggregating sorting TopK inquiry processing method and system
CN103631910A (en) * 2013-11-26 2014-03-12 烽火通信科技股份有限公司 Distributed database multi-column composite query system and method
CN104951462B (en) 2014-03-27 2018-08-03 国际商业机器公司 Method and system for managing database
CN104376029B (en) * 2014-04-10 2017-12-19 北京亚信时代融创咨询有限公司 The processing method and system of a kind of data
CN103955486B (en) * 2014-04-14 2017-08-01 五八同城信息技术有限公司 Distribution service and its data update, the method for data query
CN104008191B (en) * 2014-06-12 2018-09-28 北京京东尚科信息技术有限公司 A kind of data query method
CN105468651B (en) * 2014-09-12 2020-03-27 阿里巴巴集团控股有限公司 Relational database data query method and system
CN105574052A (en) * 2014-11-06 2016-05-11 中兴通讯股份有限公司 Database query method and apparatus
CN104361090B (en) * 2014-11-17 2018-01-05 浙江宇视科技有限公司 Data query method and device
CN104615726B (en) * 2015-02-06 2017-12-22 北京神舟航天软件技术有限公司 A kind of method based on slow loading technique displaying a large number of services object
CN105183901A (en) * 2015-09-30 2015-12-23 北京京东尚科信息技术有限公司 Method and device for reading database table through data query engine
CN105243169B (en) * 2015-11-12 2019-01-29 中国建设银行股份有限公司 A kind of data query method and system
CN107479962B (en) * 2016-06-08 2021-05-07 阿里巴巴集团控股有限公司 Method and equipment for issuing task
CN107784032B (en) * 2016-08-31 2020-06-16 华为技术有限公司 Progressive output method, device and system of data query result
WO2018058671A1 (en) 2016-09-30 2018-04-05 华为技术有限公司 Control method for executing multi-table connection operation and corresponding device
CN106570145B (en) * 2016-10-28 2020-07-10 中国科学院软件研究所 Distributed database result caching method based on hierarchical mapping
CN106776848B (en) * 2016-11-04 2020-04-17 广州市诚毅科技软件开发有限公司 Database query method and device
CN106874400A (en) * 2017-01-16 2017-06-20 努比亚技术有限公司 A kind of data processing method and server
CN107103032B (en) * 2017-03-21 2020-02-28 中国科学院计算机网络信息中心 Mass data paging query method for avoiding global sequencing in distributed environment
CN110019355A (en) * 2017-09-27 2019-07-16 北京国双科技有限公司 Independent data calculation method and device
CN107992516A (en) * 2017-10-27 2018-05-04 平安科技(深圳)有限公司 Electronic device, the method for data query and storage medium
CN107832406B (en) * 2017-11-03 2020-09-11 北京锐安科技有限公司 Method, device, equipment and storage medium for removing duplicate entries of mass log data
CN108345648B (en) * 2018-01-18 2021-01-26 奇安信科技集团股份有限公司 Method and device for acquiring log information based on columnar storage
CN109657018A (en) * 2018-11-13 2019-04-19 平安科技(深圳)有限公司 A kind of distribution vehicle operation data querying method and terminal device
CN109766366A (en) * 2019-01-07 2019-05-17 深圳市活力天汇科技股份有限公司 A kind of world air ticket asynchronous query method
CN110321388B (en) * 2019-02-26 2021-07-02 南威软件股份有限公司 Quick sequencing query method and system based on Greenplus
CN109947796B (en) * 2019-04-12 2021-04-30 北京工业大学 Caching method for query intermediate result set of distributed database system
CN110187958B (en) * 2019-06-04 2020-05-05 上海燧原智能科技有限公司 Task processing method, device, system, equipment and storage medium
CN110941619B (en) * 2019-12-02 2023-05-16 浪潮软件股份有限公司 Definition method of graph data storage model and structure for various usage scenes
CN111767560A (en) * 2020-06-24 2020-10-13 中国工商银行股份有限公司 Aggregation query method and device for multiple data sources
CN112637267B (en) * 2020-11-27 2023-06-02 成都质数斯达克科技有限公司 Service processing method, device, electronic equipment and readable storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001076192A3 (en) * 2000-03-30 2002-02-21 Intel Corp Method and device for distributed caching
CN101251861A (en) * 2008-03-18 2008-08-27 北京锐安科技有限公司 Method for loading and inquiring magnanimity data
CN101908075A (en) * 2010-08-17 2010-12-08 上海云数信息科技有限公司 SQL-based parallel computing system and method
CN102006330A (en) * 2010-12-01 2011-04-06 北京瑞信在线系统技术有限公司 Distributed cache system, data caching method and inquiring method of cache data
CN102254024A (en) * 2011-07-27 2011-11-23 国网信息通信有限公司 Mass data processing system and method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001076192A3 (en) * 2000-03-30 2002-02-21 Intel Corp Method and device for distributed caching
CN101251861A (en) * 2008-03-18 2008-08-27 北京锐安科技有限公司 Method for loading and inquiring magnanimity data
CN101908075A (en) * 2010-08-17 2010-12-08 上海云数信息科技有限公司 SQL-based parallel computing system and method
CN102006330A (en) * 2010-12-01 2011-04-06 北京瑞信在线系统技术有限公司 Distributed cache system, data caching method and inquiring method of cache data
CN102254024A (en) * 2011-07-27 2011-11-23 国网信息通信有限公司 Mass data processing system and method

Also Published As

Publication number Publication date
CN102521406A (en) 2012-06-27

Similar Documents

Publication Publication Date Title
CN102521406B (en) Distributed query method and system for complex task of querying massive structured data
CN102521405B (en) Massive structured data storage and query methods and systems supporting high-speed loading
CN103853727B (en) Improve the method and system of big data quantity query performance
US10642831B2 (en) Static data caching for queries with a clause that requires multiple iterations to execute
CN109241093B (en) Data query method, related device and database system
Santos et al. Real-time data warehouse loading methodology
CN103020204B (en) A kind of method and its system carrying out multi-dimensional interval query to distributed sequence list
CN106844703B (en) A kind of internal storage data warehouse query processing implementation method of data base-oriented all-in-one machine
CN104252536B (en) A kind of internet log data query method and device based on hbase
CN103678665A (en) Heterogeneous large data integration method and system based on data warehouses
CN102629269B (en) Searching and storing method for embedded database
CN104850572A (en) HBase non-primary key index building and inquiring method and system
CN104933160B (en) A kind of ETL frame design method towards safety monitoring business diagnosis
CN102779138B (en) The hard disk access method of real time data
CN104239377A (en) Platform-crossing data retrieval method and device
US20170068675A1 (en) Method and system for adapting a database kernel using machine learning
CN103678491A (en) Method based on Hadoop small file optimization and reverse index establishment
CN102184222A (en) Quick searching method in large data volume storage
CN102968464B (en) A kind of search method of the local resource quick retrieval system based on index
CN110309233A (en) Method, apparatus, server and the storage medium of data storage
CN109783441A (en) Mass data inquiry method based on Bloom Filter
Liu et al. Using provenance to efficiently improve metadata searching performance in storage systems
KR101955376B1 (en) Processing method for a relational query in distributed stream processing engine based on shared-nothing architecture, recording medium and device for performing the method
CN109213760B (en) High-load service storage and retrieval method for non-relational data storage
D’silva et al. Secondary indexing techniques for key-value stores: Two rings to rule them all

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20180824

Address after: 100044 B sixteen, No. 22 building, South Road, Haidian District, Beijing.

Patentee after: Guoxin electronic bill Platform Information Service Co., Ltd.

Address before: 100190 South Road, Zhongguancun Science Academy, Haidian District, Beijing 6

Patentee before: Institute of Computing Technology, Chinese Academy of Sciences