CN107506482A

CN107506482A - A kind of large-scale data processing unit and method based on Stream Processing framework

Info

Publication number: CN107506482A
Application number: CN201710835168.8A
Authority: CN
Inventors: 王军; 黄丽仪
Original assignee: Hunan Xinghan Shuzhi Technology Co Ltd
Current assignee: Hunan Xinghan Shuzhi Technology Co Ltd
Priority date: 2017-06-26
Filing date: 2017-09-15
Publication date: 2017-12-22
Also published as: CN107229747A

Abstract

The invention discloses a kind of large-scale data processing unit and method based on flow processing framework, the device includes：Topology constructing module, for building data processing topology according to XML configuration file；Data read module, for being read from data source with markd initial data, and according to mark load logic configuration file, obtain the data of additional logic configuration；Data processing module, for receiving the data of additional logic configuration, dynamic call processing method, generate result and shunted；Aggregation module, for receiving the result of shunting and being polymerize；Memory module, for receiving polymerization result, and it is stored in specified storage medium.The present invention is based on Stream Processing framework, and data processing speed is fast, being capable of the newly-increased data of timely processing；Newly-increased processing rule configured in Redis can dynamic call, the inserted mode of data is various；Only need simple configuration modification to can be used under different scenes, there is certain application prospect.

Description

A kind of large-scale data processing unit and method based on Stream Processing framework

Technical field

The present invention relates to field of computer technology, more particularly to a kind of large-scale data processing based on Stream Processing framework Device and method.

Background technology

At present, large-scale data typically uses the processing mode of multithreading list example, and this mode is normally operated in a clothes It is engaged on device, it is high to the specific aim of business, but configure accordingly less.With the growth of data explosion type, traditional data processing It is impossible to meet requirement of the large-scale data processing for speed, performance, major defect are as follows for mode：

1st, because single server is present as network stabilization is poor, the excessively high performance bottleneck of CPU usage, therefore, at data It is not fast enough to manage speed, and is unable to timely processing and increases data newly.

When the 2nd, carrying out data processing, file is configured without to intervene processing procedure, processing rule is unable to dynamic configuration, once Processing rule changes, it is necessary to restart program.

3rd, the sentence inserted in once running is fixed (such as SQL), it is impossible to dynamic is changed, and data inserted mode is single, A variety of inserted modes are not supported.

4th, a data processing can only be used in a business scenario, and very high with the coupling of specific business, independence is poor, Be inconvenient to migrate.

Therefore, a kind of large-scale data processing unit and method based on Stream Processing framework are needed badly.

The content of the invention

The purpose of invention：In order to solve technical problem present in background technology, there is provided one kind is based on Stream Processing framework Large-scale data processing unit and method.

To reach above-mentioned purpose, the technical solution adopted by the present invention is：Provide a kind of based on the big of Stream Processing framework Scale data processing unit, including：

Topology constructing module, for building data processing topology according to XML configuration file, while establish data processing topology With the connection of data source, storage medium；

Data read module, for being read from data source with markd initial data, and according to corresponding to mark loading Logical profile, obtain the data of additional logic configuration；The logical profile includes processing logic, processing method and deposited Store up logic content；

Data processing module, for receiving the data of additional logic configuration, and the processing logic in being configured according to logic is moved Processing method corresponding to state calling, generate result and simultaneously shunted according to storage logic；

Aggregation module, for receiving the result of shunting and it being polymerize, obtain polymerization result；

Memory module, for receiving polymerization result, and polymerization result is stored to the storage specified according to storage logic and is situated between In matter.

Further, the data source is message-oriented middleware or persistent storage medium.

Further, the message-oriented middleware includes：For caching the Kafka of initial data and being configured for cache logic The Redis of file, the persistent storage medium include：Relational database Mysql and index Solr.

Further, the storage medium also includes：Mongodb.

Further, the storage logic includes：Database instance, table name, inserted mode and insertion field.

Present invention also offers a kind of large-scale data processing method based on Stream Processing framework, including procedure below：

Step 1：Topology constructing module builds data processing topology according to XML configuration file, while establishes data processing and open up Flutter and the connection of data source, storage medium；

Step 2：Data read module is read with markd initial data from data source, and according to corresponding to mark loading Logical profile, the Data Concurrent for obtaining additional logic configuration give data processing module；The logical profile includes Handle logic, processing method and storage logic content；

Step 3：Data processing module receives the data of additional logic configuration, and the processing logic in being configured according to logic is moved Processing method corresponding to state calling, generate result and simultaneously shunted according to storage logic；

Step 4：Aggregation module receives the result of shunting and it is polymerize, and obtains polymerization result and be sent to deposit Store up module；

Step 5：Memory module receives polymerization result, and is stored polymerization result to the storage specified according to storage logic and be situated between In matter.

Further, the storage medium also includes：Mongodb.

The beneficial effects of the invention are as follows：The present invention, can be in more services based on Stream Processing frameworks such as storm, spark Cluster is disposed on device, data processing speed is fast, and being capable of the newly-increased data of timely processing；Newly-increased processing rule is matched somebody with somebody in Redis Put can dynamic call, without reset routine；The data source of use can be expanded laterally, and the inserted mode of data is various；In difference It is i.e. usable using the present invention only to need to carry out simple configuration modification under scene, there is certain application prospect.

Brief description of the drawings

Fig. 1 is the structured flowchart of large-scale data processing unit of the embodiment of the present invention 1 based on Stream Processing framework.

Fig. 2 is the broad flow diagram of large-scale data processing method of the embodiment of the present invention 2 based on Stream Processing framework.

Embodiment

For make present invention solves the technical problem that, the technical scheme that uses and the technique effect that reaches it is clearer, below The present invention is described in further detail in conjunction with the accompanying drawings and embodiments.It is understood that specific implementation described herein Example is used only for explaining the present invention, rather than limitation of the invention.

Embodiment 1

Environmental preparation：Large-scale data general processing unit of the present embodiment based on Stream Processing framework, dependent on streaming Framework is handled, Kafka and Redis message-oriented middlewares, the also medium for data storage, this is using Storm as bottom Handle framework.Before the device is disposed, it need to ensure that these environment are already prepared to.

Configuration prepares：When building topological, the Storm relevant configurations in XML configuration file and Redis are loaded, it is determined that The module of loading and Storm operational factor are (such as：Storm work numbers and task numbers)；In building topology and data source, storage During the connection of medium, Properties configuration files are loaded；When carrying out data processing, the configuration text in dynamic load Redis Part, the threshold value of real time modifying operation is (such as：Batch submits data volume size, waits time-out time, data processing rule etc.).

Reference picture 1, large-scale data processing unit of the present embodiment based on Stream Processing framework, including：

Topology constructing module, for building data processing topology on storm according to XML file, and establish data processing The connection of topology and data source, all storage mediums；The module is the basis of follow-up several module operations；

Data read module, for being read from data source with markd initial data, and according to corresponding to mark loading Logical profile, obtain the data of additional logic configuration；The data source is message-oriented middleware or persistent storage medium；Institute Stating message-oriented middleware includes：It is described to hold for caching the Kafka of initial data and Redis for cache logic configuration file Longization storage medium includes：Relational database Mysql and index Solr；The mark is for distinction processing method Field, can be table name, source website address of source database etc.；The logical profile includes processing logic, processing side Method and storage logic content；

Data processing module, for receiving the data of additional logic configuration, and the processing logic in being configured according to logic is moved Processing method corresponding to state calling is (such as：Place corresponding to the data in MongoDB in Student tables is configured with configuration file Reason method is parseStudent (), then module meeting dynamic call this method handles the data), generation result is simultaneously Shunted (such as according to storage logic：Need to insert same Mysql tables, and the field identical data inserted can be sent To same stream)；Because original data type and parsing require uncertain, the specific logic that handles needs voluntarily to write, together When different processing logic corresponding to different types of data is specified in logical profile；The storage logic content includes： Database instance, table name, inserted mode and insertion field, former three need to specify.

Aggregation module, for receive shunting result, and by it is identical storage logic result be aggregated to it is same Individual thread, for there are the data of priority flag, memory module is transmitted directly to, otherwise by data buffer storage, is configured until satisfaction Time-out time or quantity retransmit；

Memory module, for receiving polymerization result by batch, and polymerization result is stored and (inserted) according to storage logic Into specified storage medium；The storage medium is relational database Mysql, distribution type file data storage storehouse Mongodb With index Solr.

Large-scale data processing unit of the present embodiment based on Stream Processing framework, is broadly divided into 5 modules, passes through These block coupled in series are formed a complete handling process by Storm stream mechanism, are loaded one on startup and global are matched somebody with somebody Put, module, the Thread Count of modules and the source of initial data of loading are specified in the configuration.The present apparatus has following excellent Point：

(1) data processing speed is fast.The present apparatus is based on Storm, Spark, Samza or Jstorm Stream Processing framework, can To dispose cluster on multiple servers, can speed up processing, make full use of server performance, if desired for increase processing speed Degree only needs to laterally increase server resource.

(2) ensure that data will not lose.Data are read from kafka, it is ensured that data are at least processed once.

(3) expansibility.Data source can be message-oriented middleware, or persistent storage database, can be horizontal Expand, such as storage medium, it is existing to be stored for Mysql and solr, the insertion of the databases such as Mongodb can also be laterally increased.

(4) can dynamic loading processing logic.Newly-increased processing logic only needs in Redis configuration can be by dynamic call Arrive.

(5) can dynamic load topological structure.The bolt of processing data is loaded by way of xml, in different pieces of information processing In can selectively load, reduce server stress.

(6) data aggregate mechanism is used, data is inserted in batches, reduces network overhead.

(7) the whole handling process of data is all determined by configuration file, in different business, it is only necessary to modification configuration text Part is with regard to that can meet process demand.

Embodiment 2

Reference picture 2, large-scale data processing method of the present embodiment based on Stream Processing framework, including procedure below：

Step 1：Topology constructing module builds data processing topology according to XML configuration file on Stream Processing framework, together The connection of Shi Jianli data processings topology and data source, storage medium；

Preferably, the data source is message-oriented middleware or persistent storage medium.

Preferably, the message-oriented middleware includes：For caching the Kafka of initial data and configuring text for cache logic The Redis of part, the persistent storage medium include：Relational database Mysql, distribution type file data storage storehouse Mongodb With index Solr.

Preferably, the storage logic includes：Database instance, table name, inserted mode and insertion field.

Pay attention to, the above is only presently preferred embodiments of the present invention.It will be appreciated by those skilled in the art that the invention is not restricted to Specific embodiment described here, it can carry out various significantly changing, readjust and replacing for a person skilled in the art In generation, is without departing from protection scope of the present invention.Therefore, although having been carried out by above example to the present invention more detailed Illustrate, but the present invention is not limited only to above example, without departing from the inventive concept, can also include more Other equivalent embodiments, and the scope of the present invention is determined by scope of the appended claims.

Claims

A kind of 1. large-scale data processing unit based on Stream Processing framework, it is characterised in that including：

Topology constructing module, for building data processing topology according to XML configuration file, while establish data processing topology and number According to source, the connection of storage medium；

Data read module, for being read from data source with markd initial data, and the logic according to corresponding to mark loading Configuration file, obtain the data of additional logic configuration；The logical profile includes processing logic, processing method and storage and patrolled Collect content；

Data processing module, for receiving the data of additional logic configuration, and the processing logic dynamic in being configured according to logic is adjusted With corresponding processing method, generate result and shunted according to storage logic；

Aggregation module, for receiving the result of shunting and it being polymerize, obtain polymerization result；

Memory module, polymerization result is stored into specified storage medium for receiving polymerization result, and according to storage logic.
2. the large-scale data processing unit according to claim 1 based on Stream Processing framework, it is characterised in that described Data source is message-oriented middleware or persistent storage medium.
3. the large-scale data processing unit according to claim 2 based on Stream Processing framework, it is characterised in that described Message-oriented middleware includes：For caching the Kafka of initial data and Redis for cache logic configuration file, it is described persistently Changing storage medium includes：Relational database Mysql and index Solr.
4. the large-scale data processing unit according to claim 1 based on Stream Processing framework, it is characterised in that described Storage medium also includes：Mongodb.
5. the large-scale data processing unit according to claim 1 based on Stream Processing framework, it is characterised in that described Storage logic includes：Database instance, table name, inserted mode and insertion field.
6. a kind of large-scale data processing method based on Stream Processing framework, it is characterised in that including procedure below：

Step 1：Topology constructing module according to XML configuration file build data processing topology, while establish data processing topology with The connection of data source, storage medium；

Step 2：Data read module is read with markd initial data from data source, and the logic according to corresponding to mark loading Configuration file, the Data Concurrent for obtaining additional logic configuration give data processing module；The logical profile includes processing Logic, processing method and storage logic content；

Step 3：Data processing module receives the data of additional logic configuration, and the processing logic dynamic in being configured according to logic is adjusted With corresponding processing method, generate result and shunted according to storage logic；

Step 4：Aggregation module receives the result of shunting and it is polymerize, and obtains polymerization result and is sent to storage mould Block；

Step 5：Memory module receives polymerization result, and is stored polymerization result to the storage medium specified according to storage logic In.
7. the large-scale data processing method according to claim 6 based on Stream Processing framework, it is characterised in that described Data source is message-oriented middleware or persistent storage medium.
8. the large-scale data processing method according to claim 6 based on Stream Processing framework, it is characterised in that described Message-oriented middleware includes：For caching the Kafka of initial data and Redis for cache logic configuration file, it is described persistently Changing storage medium includes：Relational database Mysql and index Solr.
9. the large-scale data processing method according to claim 8 based on Stream Processing framework, it is characterised in that described Storage medium also includes：Mongodb.
10. the large-scale data processing unit according to claim 6 based on Stream Processing framework, it is characterised in that institute Stating storage logic includes：Database instance, table name, inserted mode and insertion field.