CN102841944A - Method achieving real-time processing of big data - Google Patents
Method achieving real-time processing of big data Download PDFInfo
- Publication number
- CN102841944A CN102841944A CN2012103064203A CN201210306420A CN102841944A CN 102841944 A CN102841944 A CN 102841944A CN 2012103064203 A CN2012103064203 A CN 2012103064203A CN 201210306420 A CN201210306420 A CN 201210306420A CN 102841944 A CN102841944 A CN 102841944A
- Authority
- CN
- China
- Prior art keywords
- data
- query
- time
- real
- client
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a big data real-time processing method and relates to the field of computer application systems. According to the method, storage of data in a database, query and transmission are concurrent and real-time. At the time of distributing tasks, filtering and indexing is conducted simultaneously. At the time of filtering and indexing, filtered index files are distributed onto a data node which simultaneously finishes queries of local files and returns data to a client. After finishing any query, the data node immediately returns query results to a user. Due to the fact that the processing process of the method is executed concurrently, hardware equipment of a computer is utilized to the uttermost degree. Concurrent execution of an efficient B+ structure and queries enables the queries to be finished in real time and greatly improves efficiency of the queries. Therefore, the user can obtain the query results when executing query operation.
Description
Technical field
The present invention relates to the computer application system field, be specifically related to a kind of method that mass data is handled in real time that relates to.
Background technology
Along with informationalized development, the growth of the data explosion formula that enterprise will handle, data volume has all reached the TB level, and the PB level has been brought a series of problem thus.Increasing of data volume, the load of system is increasing, and the warehouse-in and the query performance of data descend thereupon.Under the situation that does not increase hardware cost, how to bring into play the maximum performance of system, make warehouse-in, inquiry velocity is the fastest, is the difficult problem that many enterprises face.
The mass data processing that appears as of cloud computing provides solution route effectively; In common cloud computing solution; HDFS (a kind of distributed file system) through Hadoop (a kind of distributed system architecture) can realize mass data storage easily; Simultaneously effectively prevent Single Point of Faliure, avoid unnecessary loss.But when the enterprising line data of HDFS was retrieved, method commonly used was to open global search MapReduce (large-scale data concurrent operation), and this needs HDFS of complete filtration to go up all data of storage.In cloud computing, especially under the mass data situation, to do like this and can cause huge waste system resource, the time of labor, this obviously is not a mode that is fit to drop into real production environment.
Summary of the invention
The objective of the invention is to overcome that the frequently-used data disposal route can cause system resource waste in the existing cloud computing solution, the shortcoming that data processing time is long provides a kind of effective mass data real-time processing method.
The big data in real time disposal route of realization of the present invention, the warehouse-in of its data, inquiry, transmission all is concurrent, real-time:
(1) put in storage in real time: be the basis with existing HDFS, go up the startup multithreading at every datanode (back end) and create index, parallel establishment index file, the establishment of index is with the structure generation of B+ tree;
(2) inquiry in real time: use distributed computing system, inquire about in server end establishment and submission job (task), inquiry was divided into for three steps:
A. the enterprising line index of namenode (Control Node) is filtered, because index file name was created according to the time, according to time in the querying condition and index file name coupling, screens the index file that satisfies condition;
B. task is distributed on every datanode, passes through the B+ tree query, be met the position of the data of condition according to index file that filters out and querying condition;
C. carry out the distribution of task once more,, and return Query Result according to position reading of data on every machine of the last data that obtain of step;
(3) real-time results transmission: use jetty (a servlet container of increasing income), on HDFS, to do in the data query jetty repeating query Query Result catalogue as the web container; If be not empty, then read the Query Result file and return to client, client continues to send continue (continuation) request to server end; Server end starts multithreading and reads Query Result; Reading of data is returned to client, if the reading of data of returning is sky, flow process finishes; If be not empty, client continues to send the continue request; In the query script, any datanode successful inquiring promptly to the client return data, does not need all datanode inquiries to accomplish.
In the big data in real time disposal route of the present invention, the warehouse-in of data, inquiry, transmission all is concurrent, real-time.In creation task, filter index, when filtering index, the index file that has filtered is distributed to above the datanode, datanode accomplishes the inquiry of local file simultaneously, and to the client return data.The inquiry of any datanode is accomplished, and promptly returns Query Result to the user.The inventive method processing procedure all is concurrent execution; Utilized the hardware device of computing machine to greatest extent, the executed in parallel of efficient B+ structure and inquiry makes inquiry reach real-time completion; Greatly improve the efficient of inquiry, just can obtain Query Result when the user carries out query manipulation.
Description of drawings
Fig. 1 is based on the field index synoptic diagram of B+ tree;
Fig. 2 is the querying flow figure of jetty;
Fig. 3 is the query transmission process flow diagram of data.
Embodiment
Below in conjunction with accompanying drawing technical scheme of the present invention is elaborated:
All processing of the present invention all are concurrent execution, have utilized the hardware device of computing machine to greatest extent, have greatly improved treatment effeciency.Just can obtain Query Result when making the user carry out query manipulation.
The present invention includes the real-time warehouse-in of data, inquiry in real time, the real-time results transmission, the warehouse-in of data, inquiry, transmission all is concurrent, real-time.
1. put in storage in real time
Be the basis with existing HDFS, on every datanode, start multithreading and create index, the parallel index file of creating.The establishment mode of index is as shown in Figure 1:
Some significant fields are set up index, and with the structure generation of B+ tree, the record that each bar is new only need be inserted in the B+ tree.The insertion of B+ tree is only carried out on leaf node.To judge all behind (key-pointer) index entry of every insertion whether the subtree number in the node goes beyond the scope.Subtree number in inserting postjunction need be split into two nodes with leaf node during greater than m (exponent number of B+ tree).The maximum key and the node address that should comprise these two nodes in their parents' node simultaneously.After this, problem has belonged to the insertion in non-leaf node.The insertion of the insertion of key and leaf node is similar in non-leaf node, is limited to m on the number of the subtree in the non-leaf node, exceeds this scope and also will carry out the node division.When doing the root node division,, just must create new parents' node, as the new root of tree because there is not parents' node.
2. inquiry in real time
Use distributed computing system, at first the user end to server end sends the get request, and server end is created and submitted to job to inquire about, and inquiry was divided into for three steps:
1). namenode carries out index and filters; Because index file name was created according to the time; According to time in the querying condition (necessary free condition in the querying condition, start time and concluding time) and index file name coupling, the index file that screening satisfies condition.
2). task is distributed on every datanode, passes through the B+ tree query, be met the position of the data of condition according to index file that filters out and querying condition.
Data Position has write down the ip and the side-play amount of storage data place machines (datanode), finds machine according to ip, just can find corresponding data according to side-play amount again.
3). carry out the distribution of task once more,, and return Query Result, query results is write in the local file according to position reading of data on every machine of the last data that obtain of step.
The executed in parallel of efficient B+ structure and inquiry makes inquiry reach real-time completion.
3. real-time results transmission
Use jetty as the web container, on HDFS, do in the data query jetty repeating query Query Result catalogue; If be not empty, then read the Query Result file and return to client, client continues to send the continue request to server end; Server end starts multithreading and reads Query Result, and reading of data is returned to client, if the reading of data of returning is for empty; Flow process finishes, if be not empty, client continues to send the continue request.In the query script, any datanode successful inquiring promptly to the client return data, does not need all datanode inquiries to accomplish.
Fig. 2 is the querying flow figure of jetty, and Fig. 3 is the query transmission entire flow of data.Use jetty as the web container; At first the user end to server end sends the get request, and server end is resolved json (a kind of data interchange format of lightweight) string, according to the querying condition instantiation job object in the json string; Submit to job to carry out distributed query, last return results.On HDFS, do in the inquiry, jetty repeating query Query Result catalogue is not if be empty; Then read file and return to client; Client continues to send the continue request to service end, and server end starts multithreading and reads Query Result, and reading of data is returned to client.Finish if Query Result is the inquiry on sky and the HDFS, then return sky, flow process finishes.
Claims (1)
1. one kind big data in real time disposal route is characterized in that the warehouse-in of data, inquiry, and transmission all is concurrent, real-time:
(1) put in storage in real time: be the basis with existing HDFS, on every datanode, start multithreading and create index, the parallel index file of creating, the establishment of index generates with the structure that B+ sets;
(2) inquiry in real time: use distributed computing system, inquire about in server end establishment and submission job, inquiry was divided into for three steps:
A. the enterprising line index of namenode is filtered, because index file name was created according to the time, according to time in the querying condition and index file name coupling, screens the index file that satisfies condition;
B. task is distributed on every datanode, passes through the B+ tree query, be met the position of the data of condition according to index file that filters out and querying condition;
C. carry out the distribution of task once more,, and return Query Result according to position reading of data on every machine of the last data that obtain of step;
(3) real-time results transmission: use jetty as the web container, on HDFS, do in the data query jetty repeating query Query Result catalogue; If be not empty, then read the Query Result file and return to client, client continues to send the continue request to server end; Server end starts multithreading and reads Query Result, and reading of data is returned to client, if the reading of data of returning is for empty; Flow process finishes, if be not empty, client continues to send the continue request; In the query script, any datanode successful inquiring promptly to the client return data, does not need all datanode inquiries to accomplish.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2012103064203A CN102841944A (en) | 2012-08-27 | 2012-08-27 | Method achieving real-time processing of big data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2012103064203A CN102841944A (en) | 2012-08-27 | 2012-08-27 | Method achieving real-time processing of big data |
Publications (1)
Publication Number | Publication Date |
---|---|
CN102841944A true CN102841944A (en) | 2012-12-26 |
Family
ID=47369307
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN2012103064203A Pending CN102841944A (en) | 2012-08-27 | 2012-08-27 | Method achieving real-time processing of big data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102841944A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104424326A (en) * | 2013-09-09 | 2015-03-18 | 华为技术有限公司 | Data processing method and device |
CN104572865A (en) * | 2014-12-18 | 2015-04-29 | 泸州医学院 | Method for batch query of HBase (Hadoop Database) data for Servlet in TTLB (time to last byte) based on Lua |
CN106649847A (en) * | 2016-12-30 | 2017-05-10 | 南威软件股份有限公司 | A large data real-time processing system based on Hadoop |
CN109992575A (en) * | 2019-02-12 | 2019-07-09 | 哈尔滨学院 | The distributed memory system of big data |
CN110781001A (en) * | 2019-10-23 | 2020-02-11 | 广东浪潮大数据研究有限公司 | Kubernetes-based container environment variable checking method |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101950297A (en) * | 2010-09-10 | 2011-01-19 | 北京大学 | Method and device for storing and inquiring mass semantic data |
CN102375853A (en) * | 2010-08-24 | 2012-03-14 | 中国移动通信集团公司 | Distributed database system, method for building index therein and query method |
CN102521307A (en) * | 2011-12-01 | 2012-06-27 | 北京人大金仓信息技术股份有限公司 | Parallel query processing method for share-nothing database cluster in cloud computing environment |
CN102521406A (en) * | 2011-12-26 | 2012-06-27 | 中国科学院计算技术研究所 | Distributed query method and system for complex task of querying massive structured data |
-
2012
- 2012-08-27 CN CN2012103064203A patent/CN102841944A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102375853A (en) * | 2010-08-24 | 2012-03-14 | 中国移动通信集团公司 | Distributed database system, method for building index therein and query method |
CN101950297A (en) * | 2010-09-10 | 2011-01-19 | 北京大学 | Method and device for storing and inquiring mass semantic data |
CN102521307A (en) * | 2011-12-01 | 2012-06-27 | 北京人大金仓信息技术股份有限公司 | Parallel query processing method for share-nothing database cluster in cloud computing environment |
CN102521406A (en) * | 2011-12-26 | 2012-06-27 | 中国科学院计算技术研究所 | Distributed query method and system for complex task of querying massive structured data |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104424326A (en) * | 2013-09-09 | 2015-03-18 | 华为技术有限公司 | Data processing method and device |
CN104424326B (en) * | 2013-09-09 | 2018-06-15 | 华为技术有限公司 | A kind of data processing method and device |
CN104572865A (en) * | 2014-12-18 | 2015-04-29 | 泸州医学院 | Method for batch query of HBase (Hadoop Database) data for Servlet in TTLB (time to last byte) based on Lua |
CN104572865B (en) * | 2014-12-18 | 2018-03-20 | 泸州医学院 | Servlet batch queries HBase data methods in TTLB based on Lua |
CN106649847A (en) * | 2016-12-30 | 2017-05-10 | 南威软件股份有限公司 | A large data real-time processing system based on Hadoop |
CN109992575A (en) * | 2019-02-12 | 2019-07-09 | 哈尔滨学院 | The distributed memory system of big data |
CN109992575B (en) * | 2019-02-12 | 2020-02-14 | 哈尔滨学院 | Distributed storage system for big data |
CN110781001A (en) * | 2019-10-23 | 2020-02-11 | 广东浪潮大数据研究有限公司 | Kubernetes-based container environment variable checking method |
CN110781001B (en) * | 2019-10-23 | 2023-03-28 | 广东浪潮大数据研究有限公司 | Kubernetes-based container environment variable checking method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN101031907B (en) | Index processing | |
CN103246749B (en) | The matrix database system and its querying method that Based on Distributed calculates | |
CN104111996A (en) | Health insurance outpatient clinic big data extraction system and method based on hadoop platform | |
CN103927331B (en) | Data querying method, data querying device and data querying system | |
US20110154339A1 (en) | Incremental mapreduce-based distributed parallel processing system and method for processing stream data | |
CN102841944A (en) | Method achieving real-time processing of big data | |
CN105138661A (en) | Hadoop-based k-means clustering analysis system and method of network security log | |
CN111400326A (en) | Smart city data management system and method thereof | |
CN102932415A (en) | Method and device for storing mirror image document | |
WO2015058578A1 (en) | Method, apparatus and system for optimizing distributed computation framework parameters | |
CN111258978B (en) | Data storage method | |
CN107203532B (en) | Index system construction method, search realization method and device | |
US20200320037A1 (en) | Persistent indexing and free space management for flat directory | |
CN105930417B (en) | A kind of big data ETL interactive process platform based on cloud computing | |
CN104111958A (en) | Data query method and device | |
CN106294814A (en) | HBase secondary index based on memory database builds and the device and method of inquiry | |
CN103778251A (en) | SPARQL parallel query method facing large-scale RDF graph data | |
CN101562664A (en) | Ticket processing method and system | |
CN103678550A (en) | Mass data real-time query method based on dynamic index structure | |
CN103810272A (en) | Data processing method and system | |
CN106156319A (en) | Telescopic distributed resource description framework data storage method and device | |
CN110083600A (en) | A kind of method, apparatus, calculating equipment and the storage medium of log collection processing | |
CN102207935A (en) | Method and system for establishing index | |
CN111125213A (en) | Data acquisition method, device and system | |
CN103577614A (en) | Data acquisition method and system oriented to SAP PI application integration platform |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C02 | Deemed withdrawal of patent application after publication (patent law 2001) | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20121226 |