CN112131302A

CN112131302A - Business data analysis method and platform

Info

Publication number: CN112131302A
Application number: CN202010936462.XA
Authority: CN
Inventors: 张俊; 熊招; 单鲁军
Original assignee: Yinsheng Payment Service Co Ltd
Current assignee: Yinsheng Payment Service Co Ltd
Priority date: 2020-09-08
Filing date: 2020-09-08
Publication date: 2020-12-25
Anticipated expiration: 2040-09-08
Also published as: CN112131302B

Abstract

The invention discloses a business data analysis method, which generates basic data required by each dimension in a mode of running batch preprocessing of most basic business flow data, adopts a fragmentation technology and a scheme of Elastic Job open source middleware to formulate an optimal fragmentation strategy to process each task into millions and tens of millions of data in a large batch, breaks mass data into configurable small blocks to carry out mutually independent running batch processing and combines multithreading to finish analysis processing of mass data every day. The commercial data analysis method adopts a distributed task technology to realize the purposes of cleaning and analyzing the big data by combining the characteristics of high-performance reading and storage of MongoDB big data storage.

Description

Business data analysis method and platform

Technical Field

The invention relates to the field of big data analysis, in particular to a business data analysis method.

Background

With the development of company business, hundreds of millions of mass data are achieved, and data with different dimensions need to be cleaned, analyzed and extracted from the mass data in a targeted manner so as to facilitate macro analysis, data mining, business driving, data service provision of each business and the like.

How to effectively clean, filter, analyze, process and store mass data is a problem which needs to be solved urgently.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a business data analysis method, which adopts a distributed task technology to realize the purposes of cleaning and analyzing big data by combining the characteristics of high-performance reading and storage of MongoDB big data storage.

The technical scheme adopted by the invention for solving the technical problems is as follows:

a business data analysis method, comprising:

s1, data collection: the data comes from the main service flow water meter of each service group;

s2, preprocessing data: screening and integrating business data sets collected from one or more data sources to ensure the validity and the valuability of data needing to be analyzed;

s3, data processing and analysis: carrying out parallel analysis and processing on the mass data, mining the data relevance in a big data set, forming an image of a description mode or an attribute rule of an object, and constructing a data model and mass training data to improve the accuracy of data analysis and prediction;

s4, data storage: storing the processed data by using a mongoDB database;

s5, data visualization and application: the data is displayed to the user in a visual mode of computer graphics or images, interactive processing can be carried out between the data and the user, and data results are provided externally in an API service mode to meet application scenes.

Further, in the step S1: and the main service flow table records the most initial and most original data flow of each service function.

Further, in the step S1: the main service flow meter comprises a order flow water meter, a transaction flow water meter and a user behavior log table.

Further, in the step S2: the data source comprises a homogeneous or heterogeneous database, a file system and a service interface.

Further, in the step S3: by adopting a distributed processing technology and a storage form, the massive data is analyzed and processed in parallel, distributed statistical analysis is performed on various structured and unstructured data, and distributed mining is performed on unknown data.

Further, in the step S3: the parallel analysis and processing of the mass data comprises sorting, statistics, processing, clustering and classification and correlation analysis.

Further, in the step S4: after the data is stored, efficient query service can be provided.

The invention also discloses a business data analysis platform, comprising:

a data collection module: for collecting data from the main service flow meter of each service group;

a data preprocessing module: the system is used for screening and integrating the business data sets collected from one or more data sources so as to ensure the validity and the valuability of the data needing to be analyzed;

the data processing and analyzing module: the system is used for analyzing and processing mass data in parallel, mining data relevance in a big data set, forming an image of a description mode or an attribute rule of an object, and constructing a data model and mass training data to improve the accuracy of data analysis and prediction;

a data storage module: used for storing the processed data through the mongoDB database;

a data visualization and application module: the system is used for displaying to a user in a visual mode of computer graphics or images, can perform interactive processing with the user, and provides data results externally in an API service mode to meet application scenes.

Further, the main business pipeline table records the most initial and most original data pipeline of each business function.

Further, the main service flow meter comprises a order flow water meter, a transaction flow water meter and a user behavior log table.

Further, the data source comprises a homogeneous or heterogeneous database, a file system and a service interface.

Further, by adopting a distributed processing technology and a storage form, the massive data is analyzed and processed in parallel, distributed statistical analysis is performed on various structured and unstructured data, and distributed mining is performed on unknown data.

Further, the parallel analysis and processing of the mass data includes sorting, statistics, processing, clustering and classification, and correlation analysis.

Furthermore, after the data is stored, efficient query service can be provided.

The invention has the beneficial effects that:

1. according to the technical scheme, distributed multi-concurrency is adopted, multi-batch data batch execution is performed, the generation performance of the whole data is obviously improved, and the condition that the shutdown of a single server does not influence the server of the whole cluster is ensured. Production environment real case: before the scheme is executed, if the basic data of a certain service on a certain day is produced, 3-5 hours are possibly needed, the production can be completed within 30 minutes (millions of orders of magnitude of running water data are processed in each batch, and only about 20 minutes are needed for analyzing hundreds of thousands of basic data and storing the basic data in a warehouse through a model, so that the high efficiency of data production and the reliability and accuracy of the data are ensured).

2. According to the scheme, the required service scene data is produced in advance, and the corresponding data query and operation functions are provided by utilizing the reasonable storage mode of mongo + mysql + redis. Through preprocessing, once the service end needs data, the data can immediately respond and acquire production data in real time, millisecond-level data response of a service scene is achieved, front-end service processing capacity and user experience are improved, and brand force is improved for products.

Drawings

The invention is further illustrated with reference to the following figures and examples.

FIG. 1 is a schematic diagram of a big data processing flow structure of a business data analysis method of the present invention;

FIG. 2 is a schematic view of a weekly bulletin board in an embodiment of a business data analysis method of the present invention;

FIG. 3 is a schematic maximum value in an embodiment of a business data analysis method of the present invention;

FIG. 4 is a schematic diagram of a maximum time period in an embodiment of a business data analysis method of the present invention;

FIG. 5 is a schematic illustration of comparatives in an embodiment of a business data analysis method of the present invention;

FIG. 6 is a schematic diagram of a one-week transaction summary in an embodiment of a business data analysis method of the present invention;

FIG. 7 is a schematic diagram of a collection method in an embodiment of a business data analysis method of the present invention;

FIG. 8 is a flow chart of the management of batch tasks for a business data analysis method of the present invention;

FIG. 9 is a task-specific data production flow diagram of a business data analysis method of the present invention;

FIG. 10 is an Elastic Job task segmentation graph of a business data analysis method of the present invention.

Detailed Description

The conception, the specific structure, and the technical effects produced by the present invention will be clearly and completely described below in conjunction with the embodiments and the accompanying drawings to fully understand the objects, the features, and the effects of the present invention. It is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments, and those skilled in the art can obtain other embodiments without inventive effort based on the embodiments of the present invention, and all embodiments are within the protection scope of the present invention. In addition, all the connection/connection relations referred to in the patent do not mean that the components are directly connected, but mean that a better connection structure can be formed by adding or reducing connection auxiliary components according to specific implementation conditions. All technical characteristics in the invention can be interactively combined on the premise of not conflicting with each other.

Abbreviations and key term definitions:

BDA: business Data Analysis (Business Data Analysis platform) system abbreviation;

elastic Job: third-party distributed task technologies;

MongoDB, big data storage database software;

redis, caching the database;

spring: and the third party java open source framework.

The invention discloses a business data analysis method, which comprises the following steps:

namely, based on the mass original data, the targeted analysis data is produced and processed, and the specific description is as follows, as shown in fig. 1:

1. data collection

Data is sourced from main service flow meters of various service groups, such as order flow meters, transaction flow meters, user behavior log tables and the like, and most initial and most original data flow of each service function is recorded.

2. Data pre-processing

The big data collection process usually has one or more data sources, and the data sources include isomorphic or heterogeneous databases, file systems, service interfaces, business control, data conflicts and other influences, so that the collected big data set needs to be preprocessed firstly to ensure the accuracy and the value of the big data analysis and prediction results.

3. Data processing and analysis

The distributed processing technology is adopted to be related to storage forms, service data types and the like, mass data are analyzed and processed in parallel, distributed statistical analysis of data and distributed mining of unknown data are carried out on various structured and unstructured data in a distributed mode,

the method comprises the steps of sorting, counting, processing, clustering and classifying, association analysis and the like, the data association in a big data set is mined, an portrait of a description mode or an attribute rule of an object is formed, and a data model and massive training data are constructed to improve the accuracy of data analysis and prediction.

4. Data storage

The mongoDB database is used for storing the processed data and providing efficient query service.

5. Data visualization and application

5.1 data results visualize the process displayed to the user in a computer graphic or image intuitive manner and can be processed interactively with the user. The data visualization technology is beneficial to finding out the regularity information hidden in a large amount of service data to support management decision, and also can greatly improve the intuitiveness of the big data analysis result, thereby being convenient for a user to understand and use.

5.2 the data result is provided by API service to the outside, which satisfies various application scenarios.

Because various service-oriented scenes are various and data are also numerous and complicated, the invention cannot effectively perform complete product representation, but a real service scene can be taken as an example to perform product description, and the product cases are as follows:

product case requirements: the system provides better service for the merchants, promotes the merchant weekly reporting function, provides the last natural weekly data summarization for the merchants, and provides the business tips for the merchants based on big data analysis.

A. Week bulletin board, as shown in fig. 2:

and (3) analyzing the demand:

first-degree data: transaction amount, transaction number.

Calculating data: the pens were all.

Analyzing data: percentage of ranks in the same city

The realization principle is as follows: 1. the original transaction stream contains the most basic transaction information including merchant number, transaction amount, transaction time, etc. 2. Summarizing all the trade streams of all the merchants in a week to obtain the total trade amount of the week and the first-degree data of the number of the trade strokes, and calculating to obtain average data.3. each merchant has own Unionpay zone code, and at the moment, the first-degree data (generated by statistics according to business model rules) after the merchants with the same zone code are summarized needs to be completed, then all the merchants with the zone code are sorted and screened according to the business rule models to obtain an analysis value (the data is second-degree analysis data produced on the basis of the first-degree data), and finally, the analyzed data is subjected to rule calculation to obtain a product demand result which is more than xx% of the xx city.

B. Maximum, as shown in fig. 3:

and (3) analyzing the demand:

first-degree data: one-cycle transaction streamline information with maximum one-cycle transaction

The realization principle is as follows: as above, the most updated transaction is recorded.

C. Maximum value period, as shown in fig. 4:

and (3) analyzing the demand:

second-degree data: transaction summary data for each time period of 24 hours of the week

The realization principle is as follows: 1. each time period data for each day, 2. summarize each time period value, 3. record update maximum time period value.

D. Comparatives, as shown in FIG. 5:

and (3) analyzing the demand:

history data: calculating the difference percentage compared with the last week transaction;

the realization principle is as follows: only historical production data need be queried.

E. A week transaction summary, as shown in fig. 6:

and (3) analyzing the demand:

first-degree data: transaction information for each day of the week;

the realization principle is as follows: the same as A. 1. And summarizing data of each day, and 2. updating the transaction date with the maximum record.

F. The charge method is shown in fig. 7:

and (3) analyzing the demand:

second-degree calculation data: various charge mode ratios;

the realization principle is as follows: the same as A. 1. And (2) summarizing each charging mode, and calculating each ratio value.

The data produced by the technical scheme is applied to a scene point of a product, the actual product has many application scenes and is very complex, the realized logic process is also very complex, and only a simple product point is used for example in the simplest way.

The technical scheme adopts a mode of running batch preprocessing of the most basic service pipeline data to generate the basic data required by each dimension, and a main implementation flow can be understood by referring to fig. 8 and 9.

The scheme adopts the slicing technology of the Elastic Job open-source middleware to formulate an optimal slicing strategy to process daily mass data, each task divides the mass data to be processed into a plurality of (configurable) small pieces of data to perform mutually independent batch processing, and high-efficiency and rapid data processing analysis is completed by combining multi-server multi-thread parallel. Please refer to fig. 10.

The system is highly decoupled, other systems are not needed, and tasks are independent from each other and are processed by a front-back association relation.

And performing batch running task processing when the pressure of the server is minimum according to system monitoring and execution, and reducing the pressure influence on the server and the database to the minimum.

And various specified basic data can be accurately produced by each task through a plurality of timing tasks in combination with error-tolerant mechanisms such as manual tasks and re-running.

By combining the relationship type characteristics of point and MySQL of quick reading and storage of MongoDB and the cache technology of redis, the system can perform optimal design scheme batch production data according to different service analysis model dimensions.

In the scheme, maintainability and expandability of data and a traceability system of historical data are established, for example: the service model analyzes statistical data of a certain service dimension time interval, for example: the system can be divided into a daily run batch, a monthly run batch, an annual run batch and the like. Once the data is found to be in a problem or the batch task fails, the scheme has perfect fault-tolerant re-running processing and generates corresponding data in a targeted manner, so that the system has high traceability on the aspect of data production.

In the aspect of performance during data production, the batch running task pressure can be uniformly dispersed to different servers according to different slicing strategies and the emphasis points of different tasks, and the CPU and the memory of the service are fully utilized.

The invention adopts a distributed task technology to horizontally split the task into a plurality of subtasks, and simultaneously, the efficiency is greatly improved, and the system resources are fully utilized. The subtasks are independent from each other and do not influence each other. In addition, the distributed mode also ensures the problem of disaster recovery of the downtime of the server and ensures the stable service when the analytic data production is executed.

FIG. 8 is a flow chart for managing a batching task, the chart including:

1. the production of a business datum, first generating the total task of the task, and recording the task state;

2. according to the business rule model, when detail tasks needing to be executed are generated, each valid data is a detail task, and the task state is recorded.

The invention also discloses a business data analysis platform, comprising:

Furthermore, after the data is stored, efficient query service can be provided.

The beneficial effects of the technical scheme are as follows:

While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A business data analysis method, comprising:

s3, data processing and analysis: adopting a fragmentation technology of Elastic Job open-source middleware and formulating an optimal fragmentation strategy, processing millions of data in batches of millions and tens of millions of data for each task, splitting mass data into configurable small blocks to perform mutually independent batch processing, performing parallel analysis and processing on the mass data, mining data relevance in a large data set, forming an image of a description mode or an attribute rule of an object, and constructing a data model and mass training data to improve the accuracy of data analysis and prediction;

s4, data storage: storing the processed data by using a mongoDB database;

2. The business data analysis method of claim 1, wherein in the step S1: and the main service flow table records the most initial and most original data flow of each service function.

3. The business data analysis method of claim 1, wherein in the step S1: the main service flow meter comprises a order flow water meter, a transaction flow water meter and a user behavior log table.

4. The business data analysis method of claim 1, wherein in the step S2: the data source comprises a homogeneous or heterogeneous database, a file system and a service interface.

5. The business data analysis method of claim 1, wherein in the step S3: by adopting a distributed processing technology and a storage form, the massive data is analyzed and processed in parallel, distributed statistical analysis is performed on various structured and unstructured data, and distributed mining is performed on unknown data.

6. The business data analysis method of claim 1, wherein in the step S3: the parallel analysis and processing of the mass data comprises sorting, statistics, processing, clustering and classification and correlation analysis.

7. The business data analysis method of claim 1, wherein in the step S4: after the data is stored, efficient query service can be provided.

8. A business data analytics platform, comprising:

9. A business data analytics platform as claimed in claim 8, wherein: and the main service flow table records the most initial and most original data flow of each service function.

10. A business data analytics platform as claimed in claim 8, wherein: the main service flow meter comprises a order flow water meter, a transaction flow water meter and a user behavior log table.

11. A business data analytics platform as claimed in claim 8, wherein: the data source comprises a homogeneous or heterogeneous database, a file system and a service interface.

12. A business data analytics platform as claimed in claim 8, wherein: by adopting a distributed processing technology and a storage form, the massive data is analyzed and processed in parallel, distributed statistical analysis is performed on various structured and unstructured data, and distributed mining is performed on unknown data.

13. A business data analytics platform as claimed in claim 8, wherein: the parallel analysis and processing of the mass data comprises sorting, statistics, processing, clustering and classification and correlation analysis.

14. A business data analytics platform as claimed in claim 8, wherein: after the data is stored, efficient query service can be provided.