CN108182213A

CN108182213A - A kind of data processing optimization device and method based on distributed system

Info

Publication number: CN108182213A
Application number: CN201711382011.0A
Authority: CN
Inventors: 黄晓伟; 肖万明; 余涵; 叶承坤; 高建国
Original assignee: FUJIAN NEW LAND SOFTWARE ENGINEERING Co Ltd
Current assignee: FUJIAN NEW LAND SOFTWARE ENGINEERING Co Ltd
Priority date: 2017-12-20
Filing date: 2017-12-20
Publication date: 2018-06-19

Abstract

The present invention discloses a kind of data processing optimization device based on distributed system, and including distributed caching cluster, computing cluster and local cache master control management, distributed caching cluster carries out the storage of full dose information and detached with computing cluster；Computing cluster includes two and above calculate node, and each calculate node includes local cache, caching agent and computing unit；The local cache abstract API operates and is embedded with capacity extension and quota management function, resolves into multiple internal fragmentations by business need and realization is to the dynamic expansion and quota management of each internal fragmentation capacity, computing unit is supplied to use with jar packet forms；The monitoring work and data cached on-line synchronous function of each internal fragmentation of local cache in caching agent module corresponding server；The management of local cache master control is managed collectively the caching of each server node energetically.Distributed type assemblies caching is merged with local cache, and the Microsecond grade of the operations such as association matching, the data filtering of mass data is made to be treated as possibility.

Description

A kind of data processing optimization device and method based on distributed system

Technical field

The present invention relates to the big data data processing fields of computer information technology more particularly to a kind of distribution that is based on to be The data processing optimization device and method of system.

Background technology

With the high speed development of Internet era, huge variation has occurred in people’s lives.People can utilize internet It is worked, studying and living, also exponentially increases for the generation speed and shared speed of data, so as to cause the play of data volume Increase.But the source and type due to data become complicated variety, and data volume is very huge, have with traditional data processing method Very big difference.

In traditional data processing mode, the data volume of data storage, processing and analysis is relatively fewer, and relationship may be used Type database efficiently carries out data processing, but in mass data processing demand, and traditional technology has been unable to meet present data processing Demand, therefore industry is generally using distributed computing technology (such as hadoop, storm, spark) to mass data depth analysis again Data prediction, also referred to as Data Preparation are carried out before excavating.

Data Preparation usually has the following feature in processing data：(1) source data amount is big, mainly some signalings (sensor signaling, network element signaling etc.) or daily record (electric business accesses record, consumer record etc.) information；(2) computing cluster data Handling capacity is big, usually require that averagely every record pretreatment duration to reach tens microseconds (every server more than 50m's per second Data throughput capabilities)；(3) in Flow Technique processing, data processing, the requirement of data analysis overall process actual effect are high, to make The even Millisecond delay of real-time response, usually second grade.These are mainly used in the effective sensitive application field to data, than Such as real-time marketing, quotation analysis, position tracking；(4) data type is more, information is imperfect, needs to be associated completion, data The pretreatment works such as standardized format.The generation of the data meeting sequential of same analysis personnel (user, user group etc.), association Caching to be used for multiple times；(5) certain applications need to be filtered data, obtain the data for meeting analysis personnel.Analysis master The information of body has hundred, ten million rank magnitude, and special caching is needed to store；(6) data cached mainly some dimension tables are with dividing The information of main body is analysed, these information are stablized relatively, do not need to real-time update, life cycle is typically day or hour.

Feature described above certainly will be involved in the Data Preparation Process in mass data to Computational frame and caching technology With reference to complete the work such as information completion, escape, several scheme replies solve generally use as follows：

Scheme one：Computing unit is loaded directly into caching, as shown in Figure 1.In the Computational frame of distributed data processing, this A little computing units are worker or container one by one.In batch processing, these computing units are handling each batch Cache information will be reloaded during data, and these cache informations can not be shared between computing unit.In stream process, though As long as so cache information of loading, but shared buffer memory is still unable between computing unit.This scheme is primarily adapted for use in small The scene of caching, the no waste it will cause computing resource and memory source.

Scheme two：Distributed caching merges deployment with Computational frame, as shown in Figure 2.Introducing distributed caching (such as Redis, Memcached etc.), it can solve the problems, such as that buffer memory capacity limits.But distributed caching frame is in addition to occupying certain memory Outside resource, can also consumption calculations resource, formed with the resource contention of Computational frame, the performance of pretreatment is influenced when serious.Meanwhile The part caching of the program is there is still a need for cross-node accesses, and there is also performance issues during cross-node in scheme three.Using some quotient Distributed caching (such as Coherence), in cache information Autonomic Migration Framework to calculate node the machine or on the node that closes on, The performance issue of cross-node access can be alleviated to a certain extent, but still had the problem of resource contention.So program Be primarily adapted for use in processing data with it is data cached can be on same node, that is to say, that need to processing data can take Mould or by service area every scene.

Scheme three：Distributed caching is independently disposed with Computational frame, as shown in Figure 3.It can solve buffer memory capacity limitation, money The problem of source competes.In the case that data throughput requirement is little, this is also a kind of preferable mode of relevance grade.But data When throughput ratio is larger, under existing hardware condition, single inquiry is all Millisecond, and such processing speed is difficult to meet number According to the processing handling capacity (requirement of tens) per second.According to the concurrently inquiry of multithreading, it is possible to reduce average single inquiry Access time, but entirety cpu loads are larger.Meanwhile distributed caching (by taking Redis as an example) is using " single thread-multichannel is multiple With io models ", " queuing phenomena " when hot spot data occurs in access, is susceptible to, leads to query performance rapid drawdown, and easily occur Counterlogic cpu cores use load 100%, and other more idle situations.

Invention content

It is an object of the present invention to propose that a kind of handling capacity is high, the good data processing based on distributed system of real-time Optimize device, solve the problems, such as that existing memory waste is big.

To achieve these goals, the technical solution adopted in the present invention is：

A kind of data processing optimization device based on distributed system, including distributed caching cluster, computing cluster and sheet Ground cache master control management, distributed caching cluster include two and above cache node, carry out full dose information storage and with meter Calculate cluster separation；

Computing cluster includes two and above calculate node, and each calculate node includes local cache, caching agent and meter Calculate unit；The local cache abstract API operates and is embedded with capacity extension and quota management function, is resolved by business need Multiple internal fragmentations are simultaneously realized to the dynamic expansion and quota management of each internal fragmentation capacity, be supplied to by jar packet forms in terms of Unit is calculated to use；The monitoring work of each internal fragmentation of local cache in caching agent module corresponding server and data cached On-line synchronous function；

The local cache master control management is managed collectively the caching of each server node energetically, unified externally to provide service behaviour Make the interface with internal memory monitoring, realize the life cycle management of local cache.

Wherein, the distributed caching cluster is built by Redis or Memcached, can linear expansion.

Wherein, Java heap outer memory of the local cache for kv data structures, supports across jvm access.

Wherein, the API operations include inquiry, establishment caches, additions and deletions change caching.

Wherein, the monitoring work includes cleaning, deletes work.

Wherein, the life cycle management of the local cache includes the following steps：

S01：The cache information in distributed caching cluster is periodically handled by external application；

S02：Notice local cache master control management after having handled, the management of local cache master control will be complete in distributed caching cluster Cache information into processing is corresponded to as internal fragmentation title corresponding in local cache；

S03：Local cache master control management notifies that caching agent carries out same processing operation in each calculate node；

S04：Caching agent completes internal fragmentation cleaning, deletion or data cached on-line synchronous in corresponding local cache, The information of completion is fed back into local cache master control management；

S05：Local cache master control management record treated state abnormal generates alarm log if having.

Wherein, the processing includes update, condition monitoring.

Invention additionally discloses a kind of data processing optimization methods based on distributed system, include the following steps：

S100：Parallel to read initial data, the initial data includes data flow or data file；

S200：Computing unit is according to identification code of date, the API inquiry local cache correspondence memories that local cache is called to provide With the presence or absence of the data of the corresponding identification code of date in fragment；

S300：It, then will the corresponding number when there is the pairing code to match with the identification code of date in local cache According in the Supplementing Data to initial data of identification code, data prediction is completed；

S400：When in local cache without pairing code, then identification code of date phase is obtained from distributed caching cluster The pairing code matched；

S500：Judge whether the corresponding internal fragmentation of the pairing code write-in local cache successfully；If so, perform step Rapid S300；If it is not, then perform step S600；

S600：Judge whether the internal fragmentation space of local cache reaches predetermined threshold value；

S700：If so, old cache information in internal fragmentation is cleared up according to preset parameter, and again will pairing code write-in Local cache；

S800：If it is not, then step S500 is re-executed according to preset parameter spread internal fragmentation size.

Wherein, the data include the position signaling, web log, consumer record of user mobile phone.

Wherein, the identification code of date is international mobile subscriber identity.

Beneficial effects of the present invention are：

Described device establishes the system that ripe distributed type assemblies caching is combined with local cache, distributed type assemblies caching Store full dose cache information, memory space can linear expansion and cache node by the way of Hubei is cooled, ensure cluster Concurrent capability promotion is taken into account while High Availabitity.Local cache storage the machine calculates the valid cache information that data need so that Per data, the average behavior of the pretreatment operations such as association completion reaches tens Microsecond grades, while solve local cache across JVM It the problems such as shared, buffer memory capacity limitation, computing resource competition, network delay, is carried for links such as follow-up analysis, data minings in real time It handles up for height, the data preparation of high real-time.

Distributed type assemblies caching is merged with local cache, makes the operations such as association matching, the data filtering of mass data Microsecond grade is treated as possibility, then is combined with distributed computing framework, has played the parallel computation energy of cluster to greatest extent Power, for follow-up analysis in real time and data mining provide it is high handle up, the data preparation of high real-time.It is inquired every time in pretreatment Duration drops to Microsecond grade from Millisecond, and cache management is not take up computing resource substantially in itself, association completion etc. per data The average behavior of pretreatment operation reaches tens Microsecond grades, and each pretreatment potentiality for calculating node reaches more than 50m/s.

Description of the drawings

Fig. 1 is the system structure diagram that prior art computing unit is loaded directly into caching；

Fig. 2 is the structure diagram that prior art distributed caching merges deployment with Computational frame；

Fig. 3 is the structure diagram that prior art distributed caching is independently disposed with Computational frame；

Fig. 4 is that the present invention is based on the structure diagrams that the data processing of distributed system optimizes device；

Fig. 5 is the flow chart of the data processing optimization method the present invention is based on distributed system.

Specific embodiment

Below with reference to specific embodiment shown in the drawings, the present invention will be described in detail.But these embodiments are simultaneously The present invention is not limited, structure that those of ordinary skill in the art are made according to these embodiments, method or functionally Transformation is all contained in protection scope of the present invention.

As shown in fig.4, a kind of data processing optimization dress based on distributed system disclosed in an embodiment of the present invention It puts, including distributed caching cluster, computing cluster and local cache master control management；

Distributed caching cluster includes two and above cache node, carries out the information storage of full dose and divides with computing cluster From；Distributed caching cluster can according to the requirement linear expansion of memory space, the distributed caching cluster by Redis or Memcached is built, and can also introduce two-by-two the mutually high availability schemes such as standby as required.

Computing cluster includes two and above calculate node, and each calculate node includes local cache, caching agent and meter Calculate unit；The local cache abstract API operation, the API operations include inquiry, establishment caches, additions and deletions change caching, and embed There are capacity extension and quota management function, multiple internal fragmentations are resolved by business need and realize to each internal fragmentation capacity Dynamic expansion and quota management, computing unit is supplied to use with jar packet forms；In caching agent module corresponding server The monitoring work of each internal fragmentation of local cache, the monitoring work includes cleaning, deletes work and data cached online Synchronizing function；Local cache uses java out-pile memories, has the following characteristic：Across jv access is supported, in same service As long as a data of device caching, multiple computing units can access；Out-pile memory is a kind of lightweight design, will not be occupied too More computing resources；Out-pile memory uses the data structure of kv, keeps consistent with cache cluster；Out-pile memory also avoids rubbish Influence of the recover to performance throughout.

The local cache master control management provides the life cycle management of local cache and provides service operations and memory The interface of monitoring, the life cycle management of the local cache include the following steps：

S01：The cache information in distributed caching cluster is periodically handled by external application；The period can be according to According to the demand setting used all day long or hour rank；

Specifically, the local cache life cycle management includes local cache update operation, it is as follows：

S01：The cache information in distributed caching cluster is periodically updated by external application；The period can be according to According to the demand setting used all day long or hour rank；

S02：Notice local cache master control management after having updated, the management of local cache master control will be complete in distributed caching cluster Being corresponded into newer cache information becomes corresponding internal fragmentation title in local cache；

S03：Local cache master control management notifies that caching agent carries out same update operation in each calculate node；

S05：Local cache master control management records updated state, if there is abnormal generation alarm log.

The local cache life cycle management can also be other operations such as condition monitoring other than updating and operating, herein It repeats no more.

Fig. 4 show the two-level cache of the optimization device of the data processing based on distributed system described in the above embodiment Massive data processing system schematic diagram, whole system includes distributed caching cluster, computing cluster and independently of two Local cache master control management except a cluster, distributed caching cluster and computing cluster are made of respectively several nodes, respectively Either cache node is a PC server or the similar equipment with calculating or storage capacity to the calculate node of module deployment. It in calculate node, is made of several computing units, local cache and caching agent, while local cache can be according to changing The service attribute of raw data is divided into several internal fragmentations, and manages the contents such as spatial cache independently.

Fig. 5 is that a kind of disclosed data processing optimization method based on distributed system is opened in one embodiment of the invention, Include the following steps：

As shown in fig.5, below by taking the position signaling of user mobile phone as an example, certain web log, consumer record etc. Source data also is adapted for the present embodiment.In the position signaling of user mobile phone, there was only user's IMSI number in original position data, do not have There is cell-phone number, to be completed by being associated with the pretreatment of completion.The IMSI number of full dose (ten million rank) is stored in cache cluster (International Mobile Subscriber Identification Number, international mobile subscriber identity) with The correspondence relationship information of cell-phone number.Specifically, it is excellent to disclose a kind of data processing based on distributed system in one embodiment Change method, includes the following steps：

S200：Computing unit is key according to IMSI number, the API inquiry local cache correspondence memories that local cache is called to provide With the presence or absence of the data of the corresponding identification code of date in fragment；

S300：When there is the key-value pair with the IMSI number in local cache, then by the mobile phone of the correspondence IMSI number In number completion to initial data.Complete data prediction；

S400：When in local cache without pairing code, then the key-value pair of IMSI number is obtained from distributed caching cluster；

S500：Judge whether the corresponding internal fragmentation of IMSI number key-value pair write-in local cache successfully；If so, Perform step S300；If it is not, then perform step S600；

S700：If so, old cache information in internal fragmentation is cleared up according to preset parameter, and again by IMSI number key assignments To local cache is written；

It should be appreciated that although this specification is described in terms of embodiments, but not each embodiment only includes one A independent technical solution, this description of the specification is merely for the sake of clarity, and those skilled in the art should will say For bright book as an entirety, the technical solution in each embodiment may also be suitably combined to form those skilled in the art can With the other embodiment of understanding.

Those listed above is a series of to be described in detail only for feasibility embodiment of the invention specifically Bright, they are not to limit the scope of the invention, all equivalent implementations made without departing from skill spirit of the present invention Or change should all be included in the protection scope of the present invention.

Claims

1. a kind of data processing optimization device based on distributed system, it is characterised in that：Including distributed caching cluster, calculate Cluster and local cache master control management, distributed caching cluster include two and above cache node, and the information for carrying out full dose is deposited It stores up and is detached with computing cluster；

Computing cluster includes two and above calculate node, and each calculate node includes local cache, caching agent and calculates single Member；The local cache abstract API operates and is embedded with capacity extension and quota management function, is resolved by business need multiple Internal fragmentation is simultaneously realized to the dynamic expansion and quota management of each internal fragmentation capacity, is supplied to calculating single with jar packet forms Member uses；The monitoring work of each internal fragmentation of local cache in caching agent module corresponding server and it is data cached Line locking function；

The local cache master control management is managed collectively the caching of each server node energetically, it is unified externally provide service operations with The interface of internal memory monitoring realizes the life cycle management of local cache.

2. a kind of data processing optimization device based on distributed system according to claim 1, it is characterised in that：It is described Distributed caching cluster is built by Redis or Memcached, can linear expansion.

3. a kind of data processing optimization device based on distributed system according to claim 1, it is characterised in that：It is described Java heap outer memory of the local cache for kV data structures, supports across jvm access.

4. a kind of data processing optimization device based on distributed system according to claim 1, it is characterised in that：It is described API operations include inquiry, establishment caches, additions and deletions change caching.

5. a kind of data processing optimization device based on distributed system according to claim 1, it is characterised in that：It is described Monitoring work includes cleaning, deletes work.

6. a kind of data processing optimization device based on distributed system according to claim 1, it is characterised in that：It is described The life cycle management of local cache includes the following steps：

S02：Notice local cache master control management after having handled, the management of local cache master control will be in distributed caching clusters at completion The cache information of reason, which corresponds to, becomes corresponding internal fragmentation title in local cache；

S04：Caching agent completes internal fragmentation cleaning, deletion or data cached on-line synchronous in corresponding local cache, will be complete Into information feed back to local cache master control management；

7. a kind of data processing optimization device based on distributed system according to claim 6, it is characterised in that：It is described Processing includes update, condition monitoring.

8. a kind of data processing optimization method based on distributed system, it is characterised in that include the following steps：

S200：Computing unit is according to identification code of date, the API inquiry local cache correspondence memory fragments that local cache is called to provide In with the presence or absence of the corresponding identification code of date data；

S300：When there is the pairing code to match with the identification code of date in local cache, then the corresponding data are known In the Supplementing Data to initial data of other code, data prediction is completed；

S400：When in local cache without pairing code, then obtain what identification code of date matched from distributed caching cluster Match code；

S500：Judge whether the corresponding internal fragmentation of the pairing code write-in local cache successfully；If so, perform step S300；If it is not, then perform step S600；

S700：If so, clearing up old cache information in internal fragmentation according to preset parameter, and pairing code is written locally again Caching；

9. a kind of data processing optimization method based on distributed system according to claim 8, it is characterised in that：It is described Data include the position signaling, web log, consumer record of user mobile phone.

10. a kind of data processing optimization method based on distributed system according to claim 7 or 8 or 9, feature exist In：The identification code of date is international mobile subscriber identity.