CN107301094A

CN107301094A - The dynamic self-adapting data model inquired about towards extensive dynamic transaction

Info

Publication number: CN107301094A
Application number: CN201710325734.0A
Authority: CN
Inventors: 郭蒙雨; 康宏; 袁晓洁
Original assignee: Nankai University
Current assignee: Nankai University
Priority date: 2017-05-10
Filing date: 2017-05-10
Publication date: 2017-10-27

Abstract

The present invention relates to dynamic self-adapting data model construction method when being inquired about towards extensive dynamic transaction, comprise the following steps：Data are collected in real time from the data sources such as console, RPC, text, tail, log system, exec；When high-throughput, the speed of data acquisition and data processing in regulation real-time scene, reduction system handles the delay of extensive Dynamic workload, it is ensured that the stability of system；Each data library inquiry request in workload is handled, effective partition information is extracted, obtains real-time data model；The data in workload are persistently handled, the quantity of processing unit can dynamically be adjusted according to the scale of workload, and parallel processing can be achieved in multiple processing units；Distributed file system is write the result into, MySQL database is stored in.Present invention uses streaming framework, the reasonable distribution resource in distributed type assemblies is improved on robustness.

Description

The dynamic self-adapting data model inquired about towards extensive dynamic transaction

Technical field

The present invention relates to the dynamic self-adapting data model construction method inquired about towards extensive dynamic transaction, more particularly to The dynamic self-adapting data model constructing system inquired about towards extensive dynamic transaction.

Background technology

,, should between user and application along with quickly generating for mass data towards under the cloud computing environment of big data It is more and more frequent with interacting between application.User's request shows the characteristics of personalization, real time implementation.Therefore, large-scale OLAP (On-Line Analytical Processing) and OLTP (On-Line Transaction Processing) application need Workload is handled immediately.

The content of the invention

The technical problems to be solved by the invention are the dynamic self-adapting data models inquired about towards extensive dynamic transaction Method and the system realization based on Storm streaming frameworks.

The technical scheme that the present invention solves above-mentioned technical problem is as follows：The dynamic inquired about towards extensive dynamic transaction is adaptive Data model construction method is answered, is comprised the following steps：

Step 1：Data are collected in real time from the data sources such as console, RPC, text, tail, log system, exec；

Step 2：When high-throughput, the speed of data acquisition and data processing, drop in regulation real-time scene Low system handles the delay of extensive Dynamic workload, it is ensured that the stability of system；

Step 3：Each data library inquiry request in workload is handled, effective subregion letter is extracted Breath, obtains real-time data model；

Step 4：The data in workload are persistently handled, the quantity of processing unit can be dynamic according to the scale of workload State is adjusted, multiple processing units, and parallel processing can be achieved；

Step 5：Distributed file system is write the result into, MySQL database is stored in.

The beneficial effects of the invention are as follows：Propose the moving towards the inquiry of extensive dynamic transaction being combined with streaming framework State self-adapting data model building method, is expanded by building incidence matrix map sub-region information, and using the level of streaming framework Exhibition mechanism realizes high scalability and high-throughput adaptability.Test result indicates that the algorithm is for big rule under big data environment Mould, Dynamic workload carry out the effective means of real time data subregion.

On the basis of above-mentioned technical proposal, the present invention can also do following improvement.

Further, the step 3 further comprises：Dropped using the parallel computation mechanism of streaming framework, square is associated calculating Battle array M each attribute pair between the degree of association when, the calculating of every a line is assigned in the different computing units of streaming framework simultaneously Perform, then all intermediate results are added and obtain final result together.

It is that time complexity has been reduced to O (1) using the beneficial effect of above-mentioned further scheme, so as to improve data partition The execution efficiency of algorithm.

Further, dynamic self-adapting data model constructing system when being inquired about towards extensive dynamic transaction, including data AM access module, handling capacity adjustment module, data processing module, horizontal extension module and data memory module；

The data access module, collection stream data and adaptation high-throughput.From console, RPC, text, tail, Data are collected in real time in the data sources such as log system, exec, and real time data is provided for the further processing of streaming framework；

The handling capacity adjustment module, in big data streaming computing environment, acquisition speed and data processing speed Not necessarily synchronous, when high-throughput, handling capacity adjustment module can adjust data acquisition and number in real-time scene According to the speed of processing, reduction system handles the delay of extensive Dynamic workload, it is ensured that the stability of system；

The data processing module, is handled each data library inquiry request in workload, and obtain reality When data model, the workload of input is pre-processed, effective partition information is extracted；There are multiple processing units, Parallel processing can be achieved, time complexity is reduced；

In the case of the horizontal extension module, big data, data scale has exceeded the disposal ability of unit, in face of extensive Load, horizontal extension module can neatly carry out horizontal extension by increasing processing unit, increase algorithm degree of parallelism, reduction Algorithm complex；

The data memory module, by division result persistence, distributed file system is write by division result, is stored in MySQL database, according to these real-time results, is calculated for further studying.

Using the beneficial effect of above-mentioned further scheme solved under big data environment, towards extensive, dynamic, unknown Workload carries out the timeliness sex chromosome mosaicism of data modeling, it is necessary to which data model constructing technology is combined with streaming computing framework, Propose a set of data model constructing plan and related system based on streaming framework.

Further, dynamic self-adapting data model constructing system when being inquired about towards extensive dynamic transaction, its feature exists In：

1) dynamic self-adapting data model is built：Partitioning strategies generate with dynamic update module, each data processing it Enter Mobile state renewal to partitioning strategies afterwards；

2) fault-tolerant management：Using the fault-tolerant verification scheme of streaming framework, realize that fault-tolerant management is real for example with Kafka These flow datas, when mistake occurs in data handling procedure, are preserved a period of time by existing data playback in systems, in order to from Some point starts to re-start transmission；

3) reliability：Data access module dynamically crawl data, and being adjusted by handling capacity, it is ensured that in the case of high-throughput The stability of system processing.Handling capacity adjustment module realizes the processing to unknown data by dispatching adaptation and load balancing, Mobile state adjustment can be entered to data model with the change of workload；

4) horizontal extension：Horizontal extension module growth data processing unit when in face of extensive, dynamic load, realizes system The high scalability and high availability of system.

Brief description of the drawings

Fig. 1 is the inventive method flow chart of steps；

Fig. 2 is apparatus of the present invention structure chart.

Description of reference numerals：1-data access module；2-handling capacity adjustment module；3-data processing module；4-water Flat expansion module；5-data memory module.

Embodiment

The principle and feature of the present invention are described below in conjunction with accompanying drawing, the given examples are served only to explain the present invention, and It is non-to be used to limit the scope of the present invention.

As shown in figure 1, being the inventive method flow chart of steps；Fig. 2 is apparatus of the present invention structure chart.

Embodiment 1

Dynamic self-adapting data model construction method when being inquired about towards extensive dynamic transaction, comprises the following steps：

Step 1：The collection of data is realized with Flume.Flume is a distribution of Cloudera offers, reliable and height The data gathering system of available massive logs collection, polymerization and transmission, it can be from continuous collecting number in different data sources According to.A Data Generator is built, journal file is generated in real time, data acquisition is carried out using journal file as data source；

Step 2：Kafka is directed to the situation of high-throughput in real-time scene, and high-throughput is carried out as middleware Regulation, is adapted to the dynamic change of load；

Step 3：Load pretreatment is carried out, partitioning algorithm is run, real time partitioned scheme is obtained.When data processing is realized, Storm provides API, only need to customize Spout and Bolt function, and provide data flow between each Bolt Flow direction, just can realize the real-time calculating of convection type big data by the execution of data flow operation；

The step 3 further comprises：Dropped using the parallel computation mechanism of streaming framework, calculating each of incidence matrix M During the degree of association between attribute pair, the calculating of every a line is assigned in the different computing units of streaming framework and performed simultaneously, then All intermediate results are added together and final result is obtained.

This stage extracts the partition information in workload, carries out statistics calculating.The input in this stage is step 1 In extensive, dynamic, unknown workload, the characteristic that streaming framework is handled in real time ensure that unknown flow data can be located in time Reason, an incidence matrix for including partition information can be obtained through load mapping.

Step 4：Calculating task in Storm can parallel be carried out between multiple threads, process and server.In addition, Zookeeper provides distributed coordination service, can neatly carry out horizontal extension by adding physical node.

When mass data has access to next, multiple processes can be opened on a machine, multiple physics can also be added Node increases the quantity of processing unit, and the degree of parallelism of increase system processing realizes horizontal extension, reduce processing time；

Step 5：Data memory module is realized using MySQL database, MySQL interface is realized in Storm, will be divided Area's result is saved in MySQL database, realizes data storage.

Dynamic self-adapting data model constructing system when being inquired about towards extensive dynamic transaction, including data access module 1, handling capacity adjustment module 2, data processing module 3, horizontal extension module 4 and data memory module 5；

The data access module (1), collection stream data and adaptation high-throughput.From console, RPC, text, Data are collected in real time in the data sources such as tail, log system, exec, and real-time number is provided for the further processing of streaming framework According to；

The handling capacity adjustment module (2), in big data streaming computing environment, acquisition speed and data processing speed Degree is not necessarily synchronous, when high-throughput, handling capacity adjustment module can adjust in real-time scene data acquisition with The speed of data processing, reduction system handles the delay of extensive Dynamic workload, it is ensured that the stability of system；

The data processing module (3), is handled each data library inquiry request in workload, and obtain Real-time data model, pre-processes to the workload of input, extracts effective partition information；There are multiple processing single Member, can be achieved parallel processing, reduce time complexity；

In the case of the horizontal extension module (4), big data, data scale has exceeded the disposal ability of unit, in face of big Scale is loaded, and horizontal extension module can neatly carry out horizontal extension by increasing processing unit, increase algorithm degree of parallelism, Reduce algorithm complex；

The data memory module (5), by division result persistence, distributed file system is write by division result, is deposited Storage, according to these real-time results, is calculated in MySQL database for further studying.

The foregoing is only presently preferred embodiments of the present invention, be not intended to limit the invention, it is all the present invention spirit and Within principle, any modification, equivalent substitution and improvements made etc. should be included in the scope of the protection.

Claims

1. it is a kind of towards extensive dynamic transaction inquire about when dynamic self-adapting data model construction method, it is characterised in that bag Include following steps：

Step 2：When high-throughput, the speed of data acquisition and data processing in regulation real-time scene, reduction system The delay of the extensive Dynamic workload of system processing, it is ensured that the stability of system；

Step 3：Each data library inquiry request in workload is handled, effective partition information is extracted, obtains To real-time data model；

Step 4：The data in workload are persistently handled, the quantity of processing unit can dynamically be adjusted according to the scale of workload Whole, parallel processing can be achieved in multiple processing units；

2. it is according to claim 1 towards extensive dynamic transaction inquire about when dynamic self-adapting data model structure side Method, it is characterised in that：Step 3 further comprises：Dropped using the parallel computation mechanism of streaming framework, calculating incidence matrix M's During the degree of association between each attribute pair, the calculating of every a line is assigned in the different computing units of streaming framework and performed simultaneously, All intermediate results are added and obtain final result together again.

3. according in claim 1 to 2 it is any it is described towards extensive dynamic transaction inquire about when dynamic self-adapting data mould Type construction method, it is characterised in that：Dynamic increment updates；Handle unknown workload；In real time processing, using streaming framework and Row computing mechanism improves execution efficiency.Horizontal extension and high-throughput adaptability, WSPA is by algorithm process and streaming framework knot Close, the horizontal extension mechanism having using streaming framework, processing is extensive, Dynamic workload when, addition can be passed through Physical node neatly realizes horizontal extension in addition, by being combined with data access component, and such as Flume and Kafka can To realize in the case of the workload in face of high-throughput, algorithm still has good performance.

4. dynamic self-adapting data model constructing system when being inquired about towards extensive dynamic transaction, it is characterised in that：Including number According to AM access module (1), handling capacity adjustment module (2), data processing module (3), horizontal extension module (4) and data memory module (5)；

The data access module (1), collection stream data and adaptation high-throughput.From console, RPC, text, tail, day Data are collected in real time in the data sources such as aspiration system, exec, and real time data is provided for the further processing of streaming framework；

The handling capacity adjustment module (2), in big data streaming computing environment, acquisition speed and data processing speed are not Certain synchronous, when high-throughput, handling capacity adjustment module can adjust data acquisition and data in real-time scene The speed of processing, reduction system handles the delay of extensive Dynamic workload, it is ensured that the stability of system；

The data processing module (3), is handled each data library inquiry request in workload, and obtain in real time Data model, the workload of input is pre-processed, effective partition information is extracted；There are multiple processing units, can Parallel processing is realized, time complexity is reduced；

In the case of the horizontal extension module (4), big data, data scale has exceeded the disposal ability of unit, in face of extensive Load, horizontal extension module can neatly carry out horizontal extension by increasing processing unit, increase algorithm degree of parallelism, reduction Algorithm complex；

The data memory module (5), by division result persistence, distributed file system is write by division result, is stored in MySQL database, according to these real-time results, is calculated for further studying.

5. it is according to claim 4 towards extensive dynamic transaction inquire about when dynamic self-adapting data model build system System, it is characterised in that：

1) dynamic self-adapting data model is built：Partitioning strategies is generated and dynamic update module, right after each data processing Partitioning strategies enters Mobile state renewal；

2) fault-tolerant management：Using the fault-tolerant verification scheme of streaming framework, realize that fault-tolerant management realizes data for example with Kafka Reset, when mistake occurs in data handling procedure, these flow datas are preserved into a period of time in systems, in order to from some point Start to re-start transmission；

3) reliability：Data access module dynamically crawl data, and being adjusted by handling capacity, it is ensured that system in the case of high-throughput The stability of processing.Handling capacity adjustment module realizes the processing to unknown data by dispatching adaptation and load balancing, can be with As Mobile state adjustment is entered in the change of workload to data model；

4) horizontal extension：Horizontal extension module growth data processing unit when in face of extensive, dynamic load, realizes system High scalability and high availability.The comfortable indicating strip of infant-wear according to claim 1, it is characterised in that the sign Color with internal layer is deeper than the color of outer layer.