CN105843959A

CN105843959A - Bonus point calculation method and system based on processing of big data

Info

Publication number: CN105843959A
Application number: CN201610238150.5A
Authority: CN
Inventors: 张欣; 李卓; 黎育龙; 常涛
Original assignee: China Construction Bank Corp
Current assignee: China Construction Bank Corp
Priority date: 2016-04-18
Filing date: 2016-04-18
Publication date: 2016-08-10

Abstract

The invention discloses a bonus point calculation method and system based on processing of big data. The method comprises steps as follows: transaction flow information is uploaded to a distributed file system through a big data synchronization tool; the uploaded transaction flow information is preprocessed, and effective chronological files are obtained; bonus point information in the effective chronological files is calculated, and an bonus point result is generated; the bonus point result is adjusted, and the adjusted effective bonus point chronological files are generated; the effective bonus point chronological files are read, the effective bonus point chronological files are collected according to different dimension categories, and bonus point collection details are generated. With the adoption of the method and the system, mass data are stored in the distributed file system, arithmetic logic of customer bonus points is realized according to distributed calculating characteristics, the calculation process is simplified, and the operation efficiency of customer bonus points and the data management timeliness are improved.

Description

A kind of integral processed based on big data and system

Technical field

The present invention relates to big data processing field, specifically for, relate to a kind of integral processed based on big data and system.

Background technology

Along with the constantly extension of banking is goed deep into, banking constantly promotes for the requirement of background system, is mainly reflected in the aspects such as requirement of real-time is high, data volume is big.Especially for batch processing at end day class, particularly integrated integral calculate, under the background of current background big data quantity, batch processing task complete actual effect, upgrading in time of integrated integral, backstage big data quantity is processed as bank information technology and proposes new challenge.

Meanwhile, along with the diversification in bank data source, channel diversification, the diversification of business service, data volume to be processed is needed to have the change of matter.The development of the Internet finance simultaneously is challenged according to customer demand exploitation emerging service product for bank, and this challenge has deeper excavation demand for mass data.

Under prior art processing mode, batch processing calculating operation is uniformly processed the most centralized calculating generally by end mode day of autotask.But this pattern is difficult in adapt to day by day complicated application demand.In the information age, a lot of aspects are required for processing bulk information amount, the easily data volume of million, and centralized calculating is difficult to meet these demands.

Summary of the invention

In view of the drawbacks described above of prior art, embodiment of the present invention provides a kind of integral processed based on big data, it is possible to when solving current bank process client's integration, and data volume is big, ageing requires that high and batch processing calculates the requirement that cannot be met.

Specifically, embodiment of the present invention provides a kind of integral processed based on big data, comprising:

By big data syn-chronization instrument, transaction journal information is uploaded to distributed file system；

Described transaction journal information after described uploading is carried out pretreatment, obtains effective chronological file；

Integration information in described effective chronological file is calculated and generates integral result；

It is adjusted described integral result processing, generates the effective integral chronological file after adjusting；

Read described effective integral chronological file, further according to different dimensions classification, described effective integral chronological file is carried out aggregation process, generate integration and collect detail.

Correspondingly, embodiment of the present invention additionally provides a kind of integral and calculating system processed based on big data, comprising:

Input module, for being uploaded to distributed file system by big data syn-chronization instrument by transaction journal information；

Pretreatment module, for the described transaction journal information after described uploading is carried out pretreatment, obtains effective chronological file；

Computing module, for calculating the integration information in described effective chronological file and generate integral result；

Adjusting module, for described integral result is adjusted process, generates the effective integral chronological file after adjusting；

Summarizing module, is used for reading described effective integral chronological file, further according to different dimensions classification, described effective integral chronological file is carried out aggregation process, generates integration and collects detail.

Embodiment of the present invention is used to have a following beneficial effect:

Store mass data by distributed file system (HDFS), and realize calculating the arithmetic logic of client's integration by Distributed Calculation feature, not only simplify calculating process, improve the operation efficiency of client's integration and the ageing of data management.

Accompanying drawing explanation

Fig. 1 is the schematic flow sheet of a kind of integral processed based on big data according to embodiment of the present invention；

Fig. 2 is the Organization Chart of a kind of integral and calculating system processed based on big data according to embodiment of the present invention；

Fig. 3 shows the block diagram of the pretreatment module 200 shown in Fig. 2；

Fig. 4 shows the block diagram of the computing module 300 shown in Fig. 2；

Fig. 5 shows the block diagram of the adjusting module 400 shown in Fig. 2.

Detailed description of the invention

For the ease of understanding the various aspects of technical solution of the present invention, feature and advantage, below in conjunction with the accompanying drawings the present invention is specifically described.Should be appreciated that following various embodiments are served only for illustrating, not for limiting the scope of the invention.

First title or the term that may relate to according to the present invention is explained.

Big data: official definition refers to the data set that those data volumes are big especially, data category is especially complex, and this data set cannot store with traditional data base, manages and process.Big data be mainly characterized in that data volume is big (Volume), data category is complicated (Variety), data processing speed fast (Velocity) and data validity height (Veracity), is collectively referred to as 4V.

Centralized calculating: strengthened the computing capability of single computer by the quantity being continuously increased processor, thus improve the speed processing data

Distributed Calculation: one group of computer is connected with each other composition disperse system by network, then mass data to be processed some need to be dispersed into, transfer to the calculating unit in disperse system to calculate simultaneously, finally these result of calculation is merged and obtain final result.Although the computing capability of the single computer in disperse system is not strong, but owing to each computer only calculates a part of data, and it is that multiple stage computer calculates simultaneously, so for disperse system, the speed processing data can be far above single computer.

MapReduce:MapReduce is a kind of programming model, for the concurrent operation of large-scale dataset (more than 1TB).Concept " Map (mapping) " and " Reduce (reduction) ", be their main thought, all borrow from Functional Programming, also has the characteristic borrowed from vector programming language.It is very easy to programming personnel will not distributed parallel programming in the case of, the program of oneself is operated in distributed system.Current software realizes being to specify Map (mapping) function, it is used for one group of key-value pair to be mapped to one group of new key-value pair, specifying concurrent Reduce (reduction) function, each being used for ensureing in the key-value pair of all mappings shares identical key group.

Hadoop: a kind of distributed computing system.Use MapReduce distributed computing framework, and develop HDFS distributed file system according to GFS (Google File System), develop HBase data-storage system according to BigTable.The design that the framework of Hadoop is most crucial is exactly: HDFS and MapReduce.HDFS is that the data of magnanimity provide storage, and MapReduce is that the data of magnanimity provide calculating.

The abbreviation of HDFS:Hadoop Distributed File System.The file system that Hadoop realizes.HDFS has the feature of high fault tolerance, and is designed to be deployed on cheap (low-cost) hardware；And it provides high-throughput (high throughput) to carry out the data of access application, it is suitable for those application programs having super large data set (large data set).HDFS relaxes the requirement of (relax) POSIX, can access the data in (streaming access) file system in the form of streaming.

Spark:Spark is the general parallel computation frame of the class Hadoop MapReduce that UC Berkeley AMP lab is increased income, and the Distributed Calculation that Spark realizes based on map reduce algorithm has Hadoop MapReduce have the advantage that；But be different from MapReduce is that in the middle of Job, output and result can be saved in internal memory, thus is no longer necessary to read and write HDFS, and therefore Spark can preferably be applicable to the algorithm that data mining needs the mapreduce of iteration with machine learning etc..

RDD: elasticity distribution formula data set (resilient distributed dataset).The main abstract conception that Spark proposes.This is one fault tolerant mechanism the element set that can be operated in parallel.It is an element set, is divided on each node of cluster, can be operated in parallel.

Embodiment 1 :

Fig. 1 is the schematic flow sheet of a kind of integral processed based on big data according to embodiment of the present invention；With reference to Fig. 1, described method includes:

Process S1, by big data syn-chronization instrument, transaction journal information is uploaded to distributed file system；

Process S2, the described transaction journal information after described uploading is carried out pretreatment, obtains effective chronological file；

Process S3, the integration information in described effective chronological file is calculated and generates integral result；

Process S4, be adjusted described integral result processing, generate the effective integral chronological file after adjusting；

Process S5, read described effective integral chronological file, further according to different dimensions classification, described effective integral chronological file is carried out aggregation process, generate integration and collect detail.

Wherein, in processing S1, described by big data syn-chronization instrument, transaction journal information be uploaded to distributed file system and include: by described big data syn-chronization instrument, described transaction journal information is uploaded to nas server, described nas server is installed distributed clients, then by described client, described transaction journal information is uploaded to described distributed file system.

In addition, it is necessary to explanation, embodiments of the present invention, can be based on following technology contents during practical application:

1) Hadoop MapReduce is the sharp weapon that large-scale data (TB level) calculates, Map and Reduce is its main thought, principle is as follows: Map is responsible for breaing up data, Reduce is responsible for assembling data, user has only to realize two interfaces of map and reduce, can complete the calculating of TB DBMS.Common application includes: the data analysis application such as log analysis and data mining.The realization of MapReduce also uses Master/Slave structure.

The core procedure of MapReduce framework mainly divides two parts: Map and Reduce.When submitting to one to calculate operation to MapReduce framework, first it can split into several Map tasks calculating operation, it is then dispensed for different nodes to perform up, each Map task processes the part in input data, after Map task completes, it can generate some intermediate files, and these intermediate files will be as the input data of Reduce task.The main target of Reduce task is exactly the output of above several Map to be gathered and exports.

Summing up the processing feature of Hadoop: (1) Hadoop is data parallel, process serial. job in a job, parallel occurs in a map section and a reduce section.But the two section can not parallel running, reduce section until map section be fully finished after could start.(2) all data accessed by map process all must frozen (can not have amendment occur), until whole work job completes. this just means that be Hadoop processing data is to realize in the chain of batch processing batch-oriented style at one, this is just doomed, and it is not suitable in the mode processed based on stream stream-based, in a streaming process, data stream is that lasting must being instantly available processes in time.(3) between data, contact is to be completed by a distributed file system (HDFS).Delay can be because network I/O expense there occurs, this delay will not become the subject matter towards batch mode, in batch mode, handling capacity is only overriding concern, but this meaning is Hadoop and is not suitable for realizing, to postponing to require very strictly, being even not allow for the on-line real time system postponing to occur.

2) Spark is the general parallel computation frame of the class Hadoop MapReduce that UC Berkeley AMP lab is increased income, and the Distributed Calculation that Spark realizes based on map reduce algorithm has Hadoop MapReduce have the advantage that；But be different from MapReduce is that in the middle of Job, output and result can be saved in internal memory, thus is no longer necessary to read and write HDFS, and therefore Spark can preferably be applicable to the algorithm that data mining needs the map reduce of iteration with machine learning etc..Spark supports three kinds of distributed deployment patterns: Standalone pattern, Mesoes pattern and Yarn pattern.Wherein, the first is similar to the pattern that MapReduce 1.0 is used, inside achieves fault-tolerance and resource management, latter two is then the trend of future development, partial fault-tolerance and resource management transfer to unified resource management system to complete: allow Spark operate on a general resource management system, so can be with other Computational frames, such as MapReduce, sharing a cluster resource, maximum benefit is to reduce O&M cost and improve resource utilization (resource distribution according to need).

Bank's integrating system is currently takes batch mode to calculate.Hadoop can temporarily meet calculating demand.It is contemplated that being on the increase of data volume, what during calculating, continuous written document caused the network bandwidth becomes a bottleneck the biggest.Additionally, client that the attention of integration also results in it is strong for the demand of integration real-time.Realizing aspect, different from MapReduce, Spark is not limited to write two methods of map and reduce, it provides the most powerful internal memory and calculates (in-memory computing) model, make user can pass through programming by the middle of digital independent to the internal memory of cluster, and user can be facilitated to repeat inquiry rapidly, be highly suitable for realizing machine learning algorithm.

In view of the foregoing, Spark being introduced integral and calculating, while processing big data quantity by iterative computation in internal memory, processing speed is faster.Application and development aspect is the most more convenient, and the structure of its ecosystem from now on can also be by simply transforming support to real-time.

Integral and calculating is exactly the information such as amount of money stroke count in the flowing water produced according to client trading, calculates, according to Rule Activity information set in advance, the integrated value that its this transaction activity produces.Scores accumulated total value is calculated by each dimension again according to single transaction points value.This programme is namely based on builds complete and executable distributed type assemblies environment.Need to be equipped with JDK1.7 in this distributed environment, scala-2.10.4, Hadoop2.6.0, Spark1.1.0.Utilize Spark distributed computing framework, use the JavaAPI that Spark provides, under existing engineering framework, develop integral and calculating model.Effectively promote the ageing of Customer Acquisition integration, and reduce the network broadband bottleneck that reading and writing of files causes.

Pass through embodiments of the present invention, available distributed file system (HDFS) storage mass data, and the arithmetic logic of client's integration is realized calculating by Distributed Calculation feature, not only simplify calculating process, improve the operation efficiency of client's integration and the ageing of data management.

Embodiment 2 :

In another embodiment of the invention, described method is in addition to above-mentioned process S1 to process S5, in described process S2, transaction journal information after described uploading is carried out pretreatment farther include: read the transaction journal information of described distributed file system, described transaction journal information is carried out flowing water legitimacy verification, blacklist filtration and rule match.

In actual applications, following technical step can be applied to process: first to read the chronological file on hdfs, be converted to file type according to message configuration.Integral and calculating it is crucial that flowing water to be used and action message coupling.Match just can participate in integral and calculating.The javaAPI provided by Spark is provided.Use textFile method, transaction journal file record be read as RDD<String, it is reconfigured at file, is RDD<Map>type by map method migration.Initialize spark environment, create a SparkContext object, it is intended that access cluster.Read the input data source having been introduced into hdfs, including chronological file, integral and calculating rule file, integral and calculating regularization term etc., generate the elasticity distribution formula data set RDD that spark supports.The data validation such as flowing water legitimacy verifies: be responsible for verification flowing water field type, length.By filter method, return effective flowing water.Wherein DcompJavaRDD is the encapsulation of the javaRDD to Spark.During exploitation, need not be directly facing the javaAPI of the bottom without developer.Blacklist filters: for specifying trade company, client's blacklist carries out flowing water filtration.Such trade company, client are without integral and calculating.It is filter method equally, returns the not flowing water information in blacklist.Whether calling rule engine verification flowing water matches rule, and regulation engine returns the list of rules matched.By all rules matched with Map<RuleIdList, [ruleid]>in the flowing water Map that adds, call saveTextFileAPI and preserve effective chronological file.

Embodiment 3 :

In another embodiment of the invention, described method is except above-mentioned process S1 is to processing in addition to S5, in described process S3, described calculating include following in one or more:

Can directly calculate integral and calculating, for described effective chronological file carries out basic integral and calculating, single rewards activity points calculating and many award activity points calculate；

Correlation integral calculates, and calculates with associating many award activity points for described effective chronological file is associated single award activity points calculating；

Add up to calculate, for described effective chronological file is only called data base interface more new data.

Embodiment 4 :

In another embodiment of the invention, described method is except above-mentioned process S1 is to processing in addition to S5, in described process S3, described described integral result is adjusted process includes:

Adjust described integral result according to self-defined upper limit rule, and according to self-defined classification, described integral result is carried out merger process.

Wherein, described self-defined upper limit rule can be such as: the action need rewarding type binds type according to the integration that rule is arranged.Including a day upper limit, the moon upper limit, the accumulative upper limit in active stage.Adjust integration: monthly cap value > every day cap value；By activity cap value > bind every day.Described merger can be according to upper limit of integral with client+activity dimension, so generating with regard to integration the trial result detail for calculating, need according to client+rule merger, utilize the mapToPair method of Spark, realize self-defining function func, with client+activity number as key, flowing water and integrated value are value, generate JavaPairRDD key-value pair.Collect the integrated value that the client's all flowing water under this activity produce afterwards, adjust integration.And utilize List to return flowing water information.Utilize filter method, with adjust after flowing water as source, filter integrated value > 0 flowing water, generate adjust after effective integral flowing water detail file.

Embodiment 5 :

In another embodiment of the invention, described method is in addition to above-mentioned process S1 to process S5, and described method also includes:

Read described integration and collect detail, by calling integration more New Parent, client's integration detail is carried out real-time update.

In actual applications, it is detailed that available streaming Computational frame Strom reads the integration summarized results generated, and calls integration more New Parent, in real-time update integrated value client's integration detail list.After integral and calculating completes, score accumulation detail needs by batch updating oracle database.The integral result of HDFS form is exported to oracle database by Sqoop, then by storing process batch updating data.

The present invention, under realizing the business background that individual's integrated integral calculates, utilizes the solution of big data, it is achieved that with higher efficiency calculation client's integration.Utilize internal memory iterative computation, reduce the network bandwidth bottleneck that file read-write causes.The interaction problems of distributed storage and relational database is not solved merely with big data syn-chronization instrument.And can realize the Rule Information of the customer information preserved in oracle and integral and calculating is stored HDFS.The javaAPI providing Spark is packaged, it is provided that integral and calculating common method, uses the service logic achieving integral and calculating of the multiple RDD operations such as its filtration, mapping, merger and reduction.Additionally through non-functional test, Spark has only used the calculating resource of used by Hadoop 1/10, the 1/3 of time-consuming only Hadoop.Spark streaming in Spark ecosystem may be used for, in the integration renewal in later stage, completing real-time results and updating.Provide sufficient extensibility.

Fig. 2 is the Organization Chart of a kind of integral and calculating system processed based on big data according to embodiment of the present invention；Seeing Fig. 2, described system includes:

Upper transmission module 100, for being uploaded to distributed file system by big data syn-chronization instrument by transaction journal information；

Pretreatment module 200, for the described transaction journal information after described uploading is carried out pretreatment, obtains effective chronological file；

Computing module 300, for calculating the integration information in described effective chronological file and generate integral result；

Adjusting module 400, for described integral result is adjusted process, generates the effective integral chronological file after adjusting；

Summarizing module 500, is used for reading described effective integral chronological file, further according to different dimensions classification, described effective integral chronological file is carried out aggregation process, generates integration and collects detail.

Wherein, in described input module 100, described by big data syn-chronization instrument, transaction journal information be uploaded to distributed file system and include: by described big data syn-chronization instrument, described transaction journal information is uploaded to nas server, described nas server is installed distributed clients, then by described client, described transaction journal information is uploaded to described distributed file system.

Fig. 3 shows the block diagram of the pretreatment module shown in Fig. 2；Seeing Fig. 3, in another embodiment of the present invention, described system is in addition to having the multiple modules described in embodiment above, and described pretreatment module can farther include:

Flowing water legitimacy verification unit 210, for carrying out the pretreatment of flowing water legitimacy verification to described flowing water information；

Blacklist filter element 220, for carrying out the pretreatment of blacklist filtration to described flowing water information；

Rule match unit 230, carries out the pretreatment of rule match for described flowing water information is called regulation engine.

Fig. 4 shows the block diagram of the computing module shown in Fig. 2；Seeing Fig. 4, in another embodiment of the present invention, described system is in addition to having the multiple modules described in embodiment above, and it is one or more that described computing module can farther include in following unit:

Can direct computing unit 310, for described effective chronological file being carried out basic integral and calculating, single reward activity points calculate and many reward activity points and calculate；

Association computing unit 320, calculates with associating many award activity points for described effective chronological file is associated single award activity points calculating；

Accumulative wouldn't computing unit 330, for described effective chronological file is only called data base interface more new data.

Fig. 5 shows the block diagram of the adjusting module shown in Fig. 2；Seeing Fig. 5, in another embodiment of the present invention, described system is in addition to having the multiple modules described in embodiment above, and described adjusting module can farther include:

Upper limit adjustment unit 410, for adjusting described integral result according to self-defined upper limit rule；

Merging unit 420, for carrying out merger process according to self-defined classification by described integral result.

It addition, in another embodiment of the present invention, described system is in addition to having the multiple modules described in embodiment above, and described system may also include that

More new module, is used for reading described integration and collects detail, by calling integration more New Parent, client's integration detail is carried out real-time update.

It should be noted that the detailed description of the invention of the described integral processed based on big data, content and the effect of the embodiment corresponding with the above-mentioned integral and calculating system processed based on big data are completely the same, and relevant duplicate contents does not repeats them here.

Through the above description of the embodiments, those skilled in the art is it can be understood that can realize by the mode of software combined with hardware platform to the present invention.Based on such understanding, what background technology was contributed by technical scheme can embody with the form of software product in whole or in part, this computer software product can be stored in storage medium, such as ROM/RAM, magnetic disc, CD etc., including some instructions with so that a computer equipment (can be personal computer, server, or the network equipment etc.) perform each embodiment of the present invention or the method described in some part of embodiment.

Skilled person should be appreciated that the disclosed above embodiments of the present invention that are only, and certainly can not limit the interest field of the present invention, the equivalent variations made according to embodiment of the present invention with this, still belong to the scope that the claims in the present invention are contained.

Claims

1. the integral processed based on big data, it is characterised in that described method includes:

Read described effective integral chronological file, further according to different dimensions classification to described effective integral flowing water File carries out aggregation process, generates integration and collects detail.

2. the method for claim 1, it is characterised in that described by big data syn-chronization instrument general Transaction journal information is uploaded to distributed file system and includes:

Described transaction journal information is uploaded to nas server, in institute by described big data syn-chronization instrument State installation distributed clients on nas server, then by described client by described transaction journal information It is uploaded to described distributed file system.

3. method as claimed in claim 2, it is characterised in that the transaction journal after described uploading is believed Breath carries out pretreatment and includes:

Read the transaction journal information of described distributed file system, described transaction journal information is flowed The verification of water legitimacy, blacklist filter and rule match.

4. method as claimed in claim 3, it is characterised in that described calculating include following in one Or multiple:

Can directly calculate integral and calculating, for described effective chronological file is carried out basic integral and calculating, list Activity points calculating rewarded by pen and many award activity points calculate；

Correlation integral calculates, and rewards based on activity points by described effective chronological file is associated single Calculate and associate many award activity points calculating；

5. method as claimed in claim 4, it is characterised in that described described integral result is adjusted Whole process includes:

Described integral result is adjusted according to self-defined upper limit rule, and according to self-defined classification by described integration Result carries out merger process.

6. the method as according to any one of claim 1 to 5, it is characterised in that described method is also wrapped Include:

7. the integral and calculating system processed based on big data, it is characterised in that described system includes:

Upper transmission module, for being uploaded to distributed document by big data syn-chronization instrument by transaction journal information System；

Pretreatment module, for the described transaction journal information after described uploading is carried out pretreatment, obtains Effectively chronological file；

Computing module, for calculating the integration information in described effective chronological file and generate integration Result；

Adjusting module, for described integral result is adjusted process, generates the effective integral after adjusting Chronological file；

Summarizing module, is used for reading described effective integral chronological file, further according to different dimensions classification to institute State effective integral chronological file and carry out aggregation process, generate integration and collect detail.

8. system as claimed in claim 7, it is characterised in that described by big data syn-chronization instrument general Transaction journal information is uploaded to distributed file system and includes:

By big data syn-chronization instrument, described transaction journal information is uploaded to nas server, described Distributed clients is installed on nas server, then by described client by described transaction journal information Reach described distributed file system.

9. system as claimed in claim 8, it is characterised in that described pretreatment module includes:

Flowing water legitimacy verification unit, for carrying out the pre-place of flowing water legitimacy verification to described flowing water information Reason；

Blacklist filter element, for carrying out the pretreatment of blacklist filtration to described flowing water information；

Rule match unit, carries out rule match for described flowing water information is called regulation engine Pretreatment.

10. system as claimed in claim 9, it is characterised in that described computing module includes following unit In one or more:

Can direct computing unit, for described effective chronological file being carried out basic integral and calculating, single prize Encourage activity points to calculate and many award activity points calculating；

Association computing unit, rewards based on activity points by described effective chronological file is associated single Calculate and associate many award activity points calculating；

Accumulative wouldn't computing unit, update number for described effective chronological file only being called data base interface According to.

11. systems as claimed in claim 10, it is characterised in that described adjusting module includes:

Upper limit adjustment unit, for adjusting described integral result according to self-defined upper limit rule；

Merging unit, for carrying out merger process according to self-defined classification by described integral result.

12. systems as according to any one of claim 7 to 11, it is characterised in that described system is also wrapped Include:

More new module, is used for reading described integration and collects detail, by calling integration more New Parent, to visitor Family integration detail carries out real-time update.