CN102841944A - Method achieving real-time processing of big data - Google Patents

Method achieving real-time processing of big data Download PDF

Info

Publication number
CN102841944A
CN102841944A CN2012103064203A CN201210306420A CN102841944A CN 102841944 A CN102841944 A CN 102841944A CN 2012103064203 A CN2012103064203 A CN 2012103064203A CN 201210306420 A CN201210306420 A CN 201210306420A CN 102841944 A CN102841944 A CN 102841944A
Authority
CN
China
Prior art keywords
data
query
time
real
client
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2012103064203A
Other languages
Chinese (zh)
Inventor
张真
王磊
陈伟
王胤然
杨震宇
周亮亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NANJING INNOVATIVE CLOUD STORAGE TECHNOLOGY Co Ltd
Original Assignee
NANJING INNOVATIVE CLOUD STORAGE TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by NANJING INNOVATIVE CLOUD STORAGE TECHNOLOGY Co Ltd filed Critical NANJING INNOVATIVE CLOUD STORAGE TECHNOLOGY Co Ltd
Priority to CN2012103064203A priority Critical patent/CN102841944A/en
Publication of CN102841944A publication Critical patent/CN102841944A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a big data real-time processing method and relates to the field of computer application systems. According to the method, storage of data in a database, query and transmission are concurrent and real-time. At the time of distributing tasks, filtering and indexing is conducted simultaneously. At the time of filtering and indexing, filtered index files are distributed onto a data node which simultaneously finishes queries of local files and returns data to a client. After finishing any query, the data node immediately returns query results to a user. Due to the fact that the processing process of the method is executed concurrently, hardware equipment of a computer is utilized to the uttermost degree. Concurrent execution of an efficient B+ structure and queries enables the queries to be finished in real time and greatly improves efficiency of the queries. Therefore, the user can obtain the query results when executing query operation.

Description

A kind of method that realizes that big data in real time is handled
Technical field
The present invention relates to the computer application system field, be specifically related to a kind of method that mass data is handled in real time that relates to.
Background technology
Along with informationalized development, the growth of the data explosion formula that enterprise will handle, data volume has all reached the TB level, and the PB level has been brought a series of problem thus.Increasing of data volume, the load of system is increasing, and the warehouse-in and the query performance of data descend thereupon.Under the situation that does not increase hardware cost, how to bring into play the maximum performance of system, make warehouse-in, inquiry velocity is the fastest, is the difficult problem that many enterprises face.
The mass data processing that appears as of cloud computing provides solution route effectively; In common cloud computing solution; HDFS (a kind of distributed file system) through Hadoop (a kind of distributed system architecture) can realize mass data storage easily; Simultaneously effectively prevent Single Point of Faliure, avoid unnecessary loss.But when the enterprising line data of HDFS was retrieved, method commonly used was to open global search MapReduce (large-scale data concurrent operation), and this needs HDFS of complete filtration to go up all data of storage.In cloud computing, especially under the mass data situation, to do like this and can cause huge waste system resource, the time of labor, this obviously is not a mode that is fit to drop into real production environment.
Summary of the invention
The objective of the invention is to overcome that the frequently-used data disposal route can cause system resource waste in the existing cloud computing solution, the shortcoming that data processing time is long provides a kind of effective mass data real-time processing method.
The big data in real time disposal route of realization of the present invention, the warehouse-in of its data, inquiry, transmission all is concurrent, real-time:
(1) put in storage in real time: be the basis with existing HDFS, go up the startup multithreading at every datanode (back end) and create index, parallel establishment index file, the establishment of index is with the structure generation of B+ tree;
(2) inquiry in real time: use distributed computing system, inquire about in server end establishment and submission job (task), inquiry was divided into for three steps:
A. the enterprising line index of namenode (Control Node) is filtered, because index file name was created according to the time, according to time in the querying condition and index file name coupling, screens the index file that satisfies condition;
B. task is distributed on every datanode, passes through the B+ tree query, be met the position of the data of condition according to index file that filters out and querying condition;
C. carry out the distribution of task once more,, and return Query Result according to position reading of data on every machine of the last data that obtain of step;
(3) real-time results transmission: use jetty (a servlet container of increasing income), on HDFS, to do in the data query jetty repeating query Query Result catalogue as the web container; If be not empty, then read the Query Result file and return to client, client continues to send continue (continuation) request to server end; Server end starts multithreading and reads Query Result; Reading of data is returned to client, if the reading of data of returning is sky, flow process finishes; If be not empty, client continues to send the continue request; In the query script, any datanode successful inquiring promptly to the client return data, does not need all datanode inquiries to accomplish.
In the big data in real time disposal route of the present invention, the warehouse-in of data, inquiry, transmission all is concurrent, real-time.In creation task, filter index, when filtering index, the index file that has filtered is distributed to above the datanode, datanode accomplishes the inquiry of local file simultaneously, and to the client return data.The inquiry of any datanode is accomplished, and promptly returns Query Result to the user.The inventive method processing procedure all is concurrent execution; Utilized the hardware device of computing machine to greatest extent, the executed in parallel of efficient B+ structure and inquiry makes inquiry reach real-time completion; Greatly improve the efficient of inquiry, just can obtain Query Result when the user carries out query manipulation.
Description of drawings
Fig. 1 is based on the field index synoptic diagram of B+ tree;
Fig. 2 is the querying flow figure of jetty;
Fig. 3 is the query transmission process flow diagram of data.
Embodiment
Below in conjunction with accompanying drawing technical scheme of the present invention is elaborated:
All processing of the present invention all are concurrent execution, have utilized the hardware device of computing machine to greatest extent, have greatly improved treatment effeciency.Just can obtain Query Result when making the user carry out query manipulation.
The present invention includes the real-time warehouse-in of data, inquiry in real time, the real-time results transmission, the warehouse-in of data, inquiry, transmission all is concurrent, real-time.
1. put in storage in real time
Be the basis with existing HDFS, on every datanode, start multithreading and create index, the parallel index file of creating.The establishment mode of index is as shown in Figure 1:
Some significant fields are set up index, and with the structure generation of B+ tree, the record that each bar is new only need be inserted in the B+ tree.The insertion of B+ tree is only carried out on leaf node.To judge all behind (key-pointer) index entry of every insertion whether the subtree number in the node goes beyond the scope.Subtree number in inserting postjunction need be split into two nodes with leaf node during greater than m (exponent number of B+ tree).The maximum key and the node address that should comprise these two nodes in their parents' node simultaneously.After this, problem has belonged to the insertion in non-leaf node.The insertion of the insertion of key and leaf node is similar in non-leaf node, is limited to m on the number of the subtree in the non-leaf node, exceeds this scope and also will carry out the node division.When doing the root node division,, just must create new parents' node, as the new root of tree because there is not parents' node.
2. inquiry in real time
Use distributed computing system, at first the user end to server end sends the get request, and server end is created and submitted to job to inquire about, and inquiry was divided into for three steps:
1). namenode carries out index and filters; Because index file name was created according to the time; According to time in the querying condition (necessary free condition in the querying condition, start time and concluding time) and index file name coupling, the index file that screening satisfies condition.
2). task is distributed on every datanode, passes through the B+ tree query, be met the position of the data of condition according to index file that filters out and querying condition.
Data Position has write down the ip and the side-play amount of storage data place machines (datanode), finds machine according to ip, just can find corresponding data according to side-play amount again.
3). carry out the distribution of task once more,, and return Query Result, query results is write in the local file according to position reading of data on every machine of the last data that obtain of step.
The executed in parallel of efficient B+ structure and inquiry makes inquiry reach real-time completion.
3. real-time results transmission
Use jetty as the web container, on HDFS, do in the data query jetty repeating query Query Result catalogue; If be not empty, then read the Query Result file and return to client, client continues to send the continue request to server end; Server end starts multithreading and reads Query Result, and reading of data is returned to client, if the reading of data of returning is for empty; Flow process finishes, if be not empty, client continues to send the continue request.In the query script, any datanode successful inquiring promptly to the client return data, does not need all datanode inquiries to accomplish.
Fig. 2 is the querying flow figure of jetty, and Fig. 3 is the query transmission entire flow of data.Use jetty as the web container; At first the user end to server end sends the get request, and server end is resolved json (a kind of data interchange format of lightweight) string, according to the querying condition instantiation job object in the json string; Submit to job to carry out distributed query, last return results.On HDFS, do in the inquiry, jetty repeating query Query Result catalogue is not if be empty; Then read file and return to client; Client continues to send the continue request to service end, and server end starts multithreading and reads Query Result, and reading of data is returned to client.Finish if Query Result is the inquiry on sky and the HDFS, then return sky, flow process finishes.

Claims (1)

1. one kind big data in real time disposal route is characterized in that the warehouse-in of data, inquiry, and transmission all is concurrent, real-time:
(1) put in storage in real time: be the basis with existing HDFS, on every datanode, start multithreading and create index, the parallel index file of creating, the establishment of index generates with the structure that B+ sets;
(2) inquiry in real time: use distributed computing system, inquire about in server end establishment and submission job, inquiry was divided into for three steps:
A. the enterprising line index of namenode is filtered, because index file name was created according to the time, according to time in the querying condition and index file name coupling, screens the index file that satisfies condition;
B. task is distributed on every datanode, passes through the B+ tree query, be met the position of the data of condition according to index file that filters out and querying condition;
C. carry out the distribution of task once more,, and return Query Result according to position reading of data on every machine of the last data that obtain of step;
(3) real-time results transmission: use jetty as the web container, on HDFS, do in the data query jetty repeating query Query Result catalogue; If be not empty, then read the Query Result file and return to client, client continues to send the continue request to server end; Server end starts multithreading and reads Query Result, and reading of data is returned to client, if the reading of data of returning is for empty; Flow process finishes, if be not empty, client continues to send the continue request; In the query script, any datanode successful inquiring promptly to the client return data, does not need all datanode inquiries to accomplish.
CN2012103064203A 2012-08-27 2012-08-27 Method achieving real-time processing of big data Pending CN102841944A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2012103064203A CN102841944A (en) 2012-08-27 2012-08-27 Method achieving real-time processing of big data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2012103064203A CN102841944A (en) 2012-08-27 2012-08-27 Method achieving real-time processing of big data

Publications (1)

Publication Number Publication Date
CN102841944A true CN102841944A (en) 2012-12-26

Family

ID=47369307

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2012103064203A Pending CN102841944A (en) 2012-08-27 2012-08-27 Method achieving real-time processing of big data

Country Status (1)

Country Link
CN (1) CN102841944A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104424326A (en) * 2013-09-09 2015-03-18 华为技术有限公司 Data processing method and device
CN104572865A (en) * 2014-12-18 2015-04-29 泸州医学院 Method for batch query of HBase (Hadoop Database) data for Servlet in TTLB (time to last byte) based on Lua
CN106649847A (en) * 2016-12-30 2017-05-10 南威软件股份有限公司 A large data real-time processing system based on Hadoop
CN109992575A (en) * 2019-02-12 2019-07-09 哈尔滨学院 The distributed memory system of big data
CN110781001A (en) * 2019-10-23 2020-02-11 广东浪潮大数据研究有限公司 Kubernetes-based container environment variable checking method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101950297A (en) * 2010-09-10 2011-01-19 北京大学 Method and device for storing and inquiring mass semantic data
CN102375853A (en) * 2010-08-24 2012-03-14 中国移动通信集团公司 Distributed database system, method for building index therein and query method
CN102521307A (en) * 2011-12-01 2012-06-27 北京人大金仓信息技术股份有限公司 Parallel query processing method for share-nothing database cluster in cloud computing environment
CN102521406A (en) * 2011-12-26 2012-06-27 中国科学院计算技术研究所 Distributed query method and system for complex task of querying massive structured data

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102375853A (en) * 2010-08-24 2012-03-14 中国移动通信集团公司 Distributed database system, method for building index therein and query method
CN101950297A (en) * 2010-09-10 2011-01-19 北京大学 Method and device for storing and inquiring mass semantic data
CN102521307A (en) * 2011-12-01 2012-06-27 北京人大金仓信息技术股份有限公司 Parallel query processing method for share-nothing database cluster in cloud computing environment
CN102521406A (en) * 2011-12-26 2012-06-27 中国科学院计算技术研究所 Distributed query method and system for complex task of querying massive structured data

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104424326A (en) * 2013-09-09 2015-03-18 华为技术有限公司 Data processing method and device
CN104424326B (en) * 2013-09-09 2018-06-15 华为技术有限公司 A kind of data processing method and device
CN104572865A (en) * 2014-12-18 2015-04-29 泸州医学院 Method for batch query of HBase (Hadoop Database) data for Servlet in TTLB (time to last byte) based on Lua
CN104572865B (en) * 2014-12-18 2018-03-20 泸州医学院 Servlet batch queries HBase data methods in TTLB based on Lua
CN106649847A (en) * 2016-12-30 2017-05-10 南威软件股份有限公司 A large data real-time processing system based on Hadoop
CN109992575A (en) * 2019-02-12 2019-07-09 哈尔滨学院 The distributed memory system of big data
CN109992575B (en) * 2019-02-12 2020-02-14 哈尔滨学院 Distributed storage system for big data
CN110781001A (en) * 2019-10-23 2020-02-11 广东浪潮大数据研究有限公司 Kubernetes-based container environment variable checking method
CN110781001B (en) * 2019-10-23 2023-03-28 广东浪潮大数据研究有限公司 Kubernetes-based container environment variable checking method

Similar Documents

Publication Publication Date Title
CN101031907B (en) Index processing
CN103246749B (en) The matrix database system and its querying method that Based on Distributed calculates
CN104111996A (en) Health insurance outpatient clinic big data extraction system and method based on hadoop platform
CN103927331B (en) Data querying method, data querying device and data querying system
US20110154339A1 (en) Incremental mapreduce-based distributed parallel processing system and method for processing stream data
CN102841944A (en) Method achieving real-time processing of big data
CN105138661A (en) Hadoop-based k-means clustering analysis system and method of network security log
CN111400326A (en) Smart city data management system and method thereof
CN102932415A (en) Method and device for storing mirror image document
WO2015058578A1 (en) Method, apparatus and system for optimizing distributed computation framework parameters
CN111258978B (en) Data storage method
CN107203532B (en) Index system construction method, search realization method and device
US20200320037A1 (en) Persistent indexing and free space management for flat directory
CN105930417B (en) A kind of big data ETL interactive process platform based on cloud computing
CN104111958A (en) Data query method and device
CN106294814A (en) HBase secondary index based on memory database builds and the device and method of inquiry
CN103778251A (en) SPARQL parallel query method facing large-scale RDF graph data
CN101562664A (en) Ticket processing method and system
CN103678550A (en) Mass data real-time query method based on dynamic index structure
CN103810272A (en) Data processing method and system
CN106156319A (en) Telescopic distributed resource description framework data storage method and device
CN110083600A (en) A kind of method, apparatus, calculating equipment and the storage medium of log collection processing
CN102207935A (en) Method and system for establishing index
CN111125213A (en) Data acquisition method, device and system
CN103577614A (en) Data acquisition method and system oriented to SAP PI application integration platform

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20121226