CN107784093A

CN107784093A - A kind of distributed big data processing system

Info

Publication number: CN107784093A
Application number: CN201710954633.XA
Authority: CN
Inventors: 张炜刚
Original assignee: Zhengzhou Yunhai Information Technology Co Ltd
Current assignee: Inspur Cloud Information Technology Co Ltd
Priority date: 2017-10-13
Filing date: 2017-10-13
Publication date: 2018-03-09

Abstract

The present invention provides a kind of distributed big data processing system, and data input control module is received and transmitted to the data of system, and by the data transfer of reception to data management control module；Data management control unit receives the data of each data input control module transmission, and data are handled, and routing data to data outputting module according to different results is exported；Data outputting module is used for the data for receiving the transmission of data management control module, and is transmitted according to data processing type to data processing unit；Data processing unit carries out data processing according to the data type of reception；Distributed big data processing system passes through Transmission Control Protocol, or WEB mode realizes the Stream Processing of big data, user can quickly realize that the distribution of flow chart of data processing is built and quickly started, the processing procedure of every data in flow, which will all be recorded, facilitates user to trace to the source, and system can also dock diversified big data component to complete the circulation of data.

Description

A kind of distributed big data processing system

Technical field

The present invention relates to big data process field, more particularly to a kind of distributed big data processing system.

Background technology

The computation schema of big data is divided into batch calculating and streaming computing.Both patterns be applicable from different scenes, batch Amount calculating needs first to store to be calculated afterwards, and real-time is not high.And the data in streaming computing are real-times in a time window It is stronger.

The speed for producing and propagating with the fast development of the emerging technologies such as Internet of Things, mobile interchange, social media, data Constantly accelerate, while the value of data can also drastically reduce.Value how quickly is extracted in never disconnected caused mass data, into The urgent demand of people.

The big data streaming processing block frame of comparative maturity has on the market at present：Spark, Strom and Samza.These three realities When computing system be all the distributed system increased income, there is low delay, many advantages, such as expansible and fault-tolerance is high.But they Also there is the shortcomings that certain, such as：The demand of change can not be timely responded to, it is necessary to repack, upload；Data handling procedure is not straight See, all no data are traced to the source function.

The content of the invention

In order to overcome above-mentioned deficiency of the prior art, the present invention provides a kind of distributed big data processing system, bag Include：Several data input control modules, data management control unit, data outputting module and data processing unit；

Each data input control module receives respectively to be transmitted to the data of system, and by the data transfer of reception to data Management control module；

Data management control unit receives the data of each data input control module transmission, and data are handled, root Data outputting module is routed data to according to different results to be exported；

Data outputting module is used for the data for receiving the transmission of data management control module, and is transmitted according to data processing type To data processing unit；

Data processing unit carries out data processing according to the data type of reception；

Data management control unit includes：Document management module；

Document management module is used to the data file of reception being stored in the Hash map of JVM internal memories, and with write-ahead log Mode record the metadata of currently received data；Metadata includes the attribute of all data, performs the pointer of data content, And the state of data.

Preferably, the ability that the write-ahead log function offer processing of document management module is restarted or system exception is handled；

The data file that document management module receives includes：Main frame power failure data information, Kernel Panic data message, system Upgrade data message and periodic maintenance data message.

Preferably, data management control unit also includes：Data content management module；

Data content management module uses non-variable property and Copy on write schema management data file, by content data file It is stored on disk, when data file is read, is read in JVM internal memories.

Preferably, data management control unit also includes：Source data management module；

Source data management module is used for the history of data storage file, the data of each reception is traced to the source, to any One event of time data file can all create a new source event；Source event is one of data file current time fast According to, source event replicates the attribute of data file and the pointer for performing content data file and recording data files institute is stateful, These contents are stored in source data management module.

Source event includes：Establishment to data file, the duplication to data file and the modification to data file.

Preferably, data processing unit includes：HDFS processing modules, HBASE processing modules and KAFKA processing modules.

Preferably, data input control module is received using Transmission Control Protocol and transmitted to the data of system, or uses sockte side Formula is received and transmitted to the data of system, or is received and transmitted to the data of system by the way of WEB.

Preferably, data input control module, data management control unit, data outputting module and data processing unit Between pass through avro forms carry out Deta bearer circulation.

As can be seen from the above technical solutions, the present invention has advantages below：

Distributed big data processing system realizes big data by Transmission Control Protocol, or sockte modes, or WEB mode Stream Processing, user can quickly realize that the distribution of flow chart of data processing is built and quickly started, every number in flow According to processing procedure will all be recorded and facilitate user to trace to the source, system can also dock diversified big data component and come Into the circulation of data.

Brief description of the drawings

In order to illustrate more clearly of technical scheme, the required accompanying drawing used in description will be made below simple Ground introduction, it should be apparent that, drawings in the following description are only some embodiments of the present invention, for ordinary skill For personnel, on the premise of not paying creative work, other accompanying drawings can also be obtained according to these accompanying drawings.

Fig. 1 is the overall schematic of distributed big data processing system；

Fig. 2 is distributed big data processing system embodiment schematic diagram.

Embodiment

To enable goal of the invention, feature, the advantage of the present invention more obvious and understandable, will use below specific Embodiment and accompanying drawing, the technical scheme protected to the present invention are clearly and completely described, it is clear that implementation disclosed below Example is only part of the embodiment of the present invention, and not all embodiment.Based on the embodiment in this patent, the common skill in this area All other embodiment that art personnel are obtained under the premise of creative work is not made, belong to the model of this patent protection Enclose.

The present embodiment provides a kind of distributed big data processing system, as shown in Figure 1 and Figure 2, including：Several data Input control module 1, data management control unit 2, data outputting module 3 and data processing unit 4；

Each data input control module 1 receives respectively to be transmitted to the data of system, and by the data transfer of reception to number According to management control module 2；

Data management control unit 2 receives the data of each data input control module transmission, and data are handled, root Data outputting module 3 is routed data to according to different results to be exported；

Data outputting module 3 is used for the data for receiving the transmission of data management control module, and is passed according to data processing type Transport to data processing unit 4；Data processing unit 4 carries out data processing according to the data type of reception；The type of data processing Including：For the data type of HDFS data processings, for the data type of HBASE data processings, and for KAFKA data The data type of processing.

Data management control unit 2 includes：Document management module 21；Document management module 21 is used for the data text of reception Part is stored in the Hash map of JVM internal memories, so very efficient can obtain processing data, and record and work as in a manner of write-ahead log The metadata of the data of preceding reception；Metadata includes the attribute of all data, performs the pointer of data content, and the shape of data State.The write-ahead log function offer processing of document management module is restarted or the ability of system exception processing；Document management module connects The data file of receipts includes：Main frame power failure data information, Kernel Panic data message, system upgrade data message and cycle dimension Protect data message.

In the present embodiment, data management control unit 2 also includes：Data content management module 22 and source data management mould Block 23；

Data content management module 22 uses non-variable property and Copy on write schema management data file, to ensure maximum Speed and thread-safe.Content data file is stored on disk, when data file is read, read in JVM internal memories Take.Small and effective data so can be only handled, and without all the elements are all read in JVM.Therefore for example split, polymerize, It is very easy to shift the operations such as large-scale target, it is not necessary to damages internal memory.

Source data management module 23 is used for the history of data storage file, and the data of each reception are traced to the source, to appointing One event of one time data file can all create a new source event；Source event is one of data file current time Snapshot, source event replicate the attribute of data file and perform the pointer and all shapes of recording data files of content data file State, these contents are stored in source data management module.Source event includes：Establishment to data file, to data file Duplication and the modification to data file.

Data management control unit 2 save current stream file in data flow initial data, content repository store work as Preceding and the content of historical file, data management control unit 2 store the historical record of file.

Programming logos of the distributed big data processing system based on work streaming, system is highly susceptible to using, it is reliable and Height is configurable.Possesses data backdating capability.User interface allows user intuitively to understand in WEB and holds friendship with data flow Mutually, more rapidly it is iterated with safety.Data backtracking characteristic allows user to check how an object circulates between system, returns Situation about occurring after before putting and visualizing committed step.

Distributed big data processing system big data streaming system architecture, can be with efficient process Internet of Things, mobile terminal etc. Caused mass data.Support the quick change of handling process to tackle continually changing demand.Include data backdating capability.Can Be efficiently applied to financial air control it is counter cheat, the scene such as personnel at risk's early warning.

In the present embodiment, data processing unit includes：HDFS processing modules, HBASE processing modules and KAFKA processing moulds Block.Data processing can be carried out according to different usage scenarios, or use environment based on different processing modes.

In the present embodiment, data input control module is received using Transmission Control Protocol and transmitted to the data of system, or is used Sockte modes, which receive, to be transmitted to the data of system, or is received and transmitted to the data of system by the way of WEB.System can lead to Cross multiple channel and get data message.

Data input control module can use single data receiver mode, the combination of several data input control modules It is achieved that a variety of different data receiver modes.

Data input control module, data management control unit, lead between data outputting module and data processing unit Cross avro forms and carry out Deta bearer circulation, can so improve data processing and the circulation efficiency of internal system.

User interface of the system based on WEB:The design of data flow can be carried out, is controlled, feedback and monitoring.System is supported more Kind recording controller docking big data and Internet of Things screen component.Also User Defined controller is supported；The persistence that system can pass through Write-ahead log (WAL) and content repository ensure the reliability of data.System can carry out the historical trace of data.System can To carry out distributed deployment.

The foregoing description of the disclosed embodiments, professional and technical personnel in the field are enable to realize or using the present invention. A variety of modifications to these embodiments will be apparent for those skilled in the art, as defined herein General Principle can be realized in other embodiments without departing from the spirit or scope of the present invention.Therefore, it is of the invention The embodiments shown herein is not intended to be limited to, and is to fit to and principles disclosed herein and features of novelty phase one The most wide scope caused.

Claims

A kind of 1. distributed big data processing system, it is characterised in that including：Several data input control modules, data Management control unit, data outputting module and data processing unit；

Each data input control module receives respectively to be transmitted to the data of system, and by the data transfer of reception to data management Control module；

Data management control unit receives the data of each data input control module transmission, and data are handled, according to not Same result routes data to data outputting module and exported；

Data outputting module is used for the data for receiving the transmission of data management control module, and is transmitted according to data processing type to number According to processing unit；

Data processing unit carries out data processing according to the data type of reception；

Data management control unit includes：Document management module；

Document management module is used to the data file of reception being stored in the Hash map of JVM internal memories, and with the side of write-ahead log Formula records the metadata of currently received data；Metadata includes the attribute of all data, performs the pointer of data content, and The state of data.
2. distributed big data processing system according to claim 1, it is characterised in that

The write-ahead log function offer processing of document management module is restarted or the ability of system exception processing；

The data file that document management module receives includes：Main frame power failure data information, Kernel Panic data message, system upgrade Data message and periodic maintenance data message.
3. distributed big data processing system according to claim 1 or 2, it is characterised in that

Data management control unit also includes：Data content management module；

Data content management module uses non-variable property and Copy on write schema management data file, and content data file is preserved On disk, when data file is read, read in JVM internal memories.
4. distributed big data processing system according to claim 1 or 2, it is characterised in that

Data management control unit also includes：Source data management module；

Source data management module is used for the history of data storage file, the data of each reception is traced to the source, to any time One event of data file can all create a new source event；Source event is a snapshot of data file current time, Source event replicates the attribute of data file and the pointer for performing content data file and recording data files institute is stateful, by this A little contents are stored in source data management module.

Source event includes：Establishment to data file, the duplication to data file and the modification to data file.
5. distributed big data processing system according to claim 1 or 2, it is characterised in that

Data processing unit includes：HDFS processing modules, HBASE processing modules and KAFKA processing modules.
6. distributed big data processing system according to claim 1 or 2, it is characterised in that

Data input control module is received using Transmission Control Protocol and transmitted to the data of system, or receives transmission using sockte modes To the data of system, or received and transmitted to the data of system by the way of WEB.
7. according to the distributed big data processing system described in claim 1 or 2, it is characterised in that

Data input control module, data management control unit, passes through between data outputting module and data processing unit Avro forms carry out Deta bearer circulation.