CN112487068A

CN112487068A - Data statistical analysis system and method

Info

Publication number: CN112487068A
Application number: CN201910856752.0A
Authority: CN
Inventors: 龚文文; 叶军; 陶海洋
Original assignee: ZTE Corp
Current assignee: ZTE Corp
Priority date: 2019-09-11
Filing date: 2019-09-11
Publication date: 2021-03-12
Also published as: WO2021047506A1

Abstract

The embodiment of the invention discloses a data statistical analysis system and a method, wherein the system comprises: an ADMA application algorithm unit; the ADMA application algorithm unit includes: the table modeling module is used for creating a spark table; the algorithm modeling module is used for providing an sql algorithm; the first task modeling module is used for creating a spark task according to the sql algorithm; and the second task modeling module is used for creating the ETL task. Therefore, the ADMA application algorithm is standardized, the development workload is greatly reduced, and the development and maintenance cost of a statistical analysis system is reduced.

Description

Data statistical analysis system and method

Technical Field

The embodiment of the invention relates to the field of data statistical analysis, in particular to a data statistical analysis system and a data statistical analysis method.

Background

With the improvement of communication network technology, from 3G (3rd generation) to 4G (4th generation) and from 4G to 5G (5th generation), the traffic used by users is increasing, and the demand for faster and more stable services is becoming more obvious. Operators are also striving to provide faster and more stable services, and as the number of users increases, the scale of the services is continuously enlarged, the generated service data is more and more, and the operators need more and more data statistical analysis indexes to monitor and ensure the stable operation of the services.

For this reason, a large number of data statistics analysis items, such as statistics server, log server, operation and maintenance operation server, have appeared, which can satisfy the needs of each large operator, but exhibit significant fragmentation among the items, for example: the presentation is that the log server calls a shell script for data analysis through a crontab timing task, and the result is output to an ES index; the statistical server outputs the result to an oracle or Gbase data table; if the data analysis task needs to be executed again, the log server needs to log in a linux server to manually execute the shell script; the statistics server needs to log in a database to manually perform a stored procedure, etc.

It can be seen that these statistical analysis items involve a large number of modules (ES, oracle, Gbase, etc.), and the underlying processing mechanisms for data are different from each other. In order to deliver the projects, too much manpower is required to be invested, and a phenomenon of repeated development also occurs, so that the development and maintenance cost of the statistical analysis system is very high.

Disclosure of Invention

In view of this, an embodiment of the present invention provides a data statistical analysis system, including: an ADMA application algorithm unit;

the ADMA application algorithm unit includes:

the table modeling module is used for creating a spark table;

the algorithm modeling module is used for providing an sql algorithm;

the first task modeling module is used for creating a spark task according to the sql algorithm;

and the second task modeling module is used for creating the ETL task.

The embodiment of the invention also provides a data statistical analysis method, which comprises the following steps:

the ADMA application algorithm unit creates a spark table, a spark task and an ETL task;

wherein the spark table comprises: a user information spark table, a user information preprocessing spark table and a user total number index spark table; the spark task comprises the following steps: the method comprises the steps of mapping a user information data task, preprocessing a user information task and a user total number index task; the ETL tasks include: user information ETL task.

The technical scheme provided by the embodiment of the invention standardizes the ADMA application algorithm, greatly reduces the development workload and reduces the development and maintenance cost of a statistical analysis system.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the example serve to explain the principles of the invention and not to limit the invention.

Fig. 1 is a schematic structural diagram of a data statistical analysis system according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a data statistical analysis system according to another embodiment of the present invention;

FIG. 3 is a flow chart illustrating a data statistical analysis method according to an embodiment of the present invention;

fig. 4 is a schematic flow chart of a data statistical analysis method according to another embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail below with reference to the accompanying drawings. It should be noted that the embodiments and features of the embodiments in the present application may be arbitrarily combined with each other without conflict.

The steps illustrated in the flow charts of the figures may be performed in a computer system such as a set of computer-executable instructions. Also, while a logical order is shown in the flow diagrams, in some cases, the steps shown or described may be performed in an order different than here.

Fig. 1 is a schematic structural diagram of a data statistical analysis system according to an embodiment of the present invention, as shown in fig. 1, the system includes: an ADMA application algorithm unit;

the ADMA application algorithm unit includes:

the table modeling module is used for creating a spark table;

the algorithm modeling module is used for providing an sql algorithm;

and the second task modeling module is used for creating the ETL task.

Wherein the spark table comprises: a user information spark table, a user information preprocessing spark table and a user total number index spark table;

the spark task comprises the following steps: the method comprises the steps of mapping a user information data task, preprocessing a user information task and a user total number index task;

the ETL tasks include: user information ETL task.

The table modeling module is specifically used for creating the user information spark table, the user information preprocessing spark table and the user total number index spark table according to the xml file of the table and the xml file of the summary table;

the algorithm modeling module is specifically used for instantiating the sql algorithm by adopting sql according to a configuration algorithm, sql file, algorithm, xml file and algorithm, conf file;

the first task modeling module is specifically used for creating a user information data mapping task, a user information preprocessing task and a user total number index task by adopting the instantiated sql algorithm according to a task xml file and a virtual task xml file;

the second task modeling module is specifically used for creating an ETL task of user information according to an ELT rule.

The ELT rule comprises an XML file of the table, an XML file of the summary table, a configuration algorithm, an sql file, an algorithm, an xml file and an algorithm, a conf file, a task xml file and a virtual task xml file, wherein the ELT rule adopts standardized versions.

Wherein, this system still includes:

the device comprises a data acquisition unit and a storage unit;

the data acquisition unit is used for introducing user information original data and outputting the user information original data to the ADMA application algorithm unit;

the ADMA application algorithm unit further includes: an ETL module and a calculation module;

the ETL module is used for calling the user information ETL task to process the user information original data and then outputting the processed user information original data to the computing module;

the computing module is used for calling the user information data mapping task to perform data mapping on the processed data to a user information spark table, calling the user information preprocessing task to perform data preprocessing on the processed data and write the preprocessed data to the user information preprocessing spark table, calling a user total number index task to perform index data aggregation on the processed data and write the processed data to the user total number index spark table; and then saving the user information spark table, the user information preprocessing spark table and the user total number index spark table in the storage unit.

The user information data mapping task, the user information preprocessing task and the user total number index task are in a data driving mode, and the user information ETL task is in a timing execution mode.

Wherein, this system still includes: a management portal;

the ADMA application algorithm unit is also used for synchronizing a user information spark table, a user information preprocessing spark table and a user total index spark table which are stored in the storage unit, and a user information data mapping task, a user information preprocessing task, a user total index task and a user information ETL task to the management portal;

and the management portal is used for displaying the user information data mapping task, the user information preprocessing task, the user total number index task and the user information ETL task in a classified mode.

The management portal is further used for performing blood vessel analysis, monitoring the task execution state and re-executing the task on the user information spark table, the user information preprocessing spark table, the user total number index spark table, the user information data mapping task, the user information preprocessing task, the user total number index task and the user information ETL task.

And the management portal is also used for supporting the supplement of the original data of the user information when the original data of the user information is not introduced in time.

Fig. 2 is a schematic structural diagram of a data statistical analysis system according to another embodiment of the present invention, as shown in fig. 2, the system includes:

the system comprises an ADMA application algorithm unit, a management portal, a data acquisition unit and a storage unit;

the ADMA application algorithm unit is used for providing an ADMA application algorithm, and the ADMA application algorithm implementation comprises table modeling, algorithm modeling and task modeling;

the table modeling comprises an xml file of the table and an xml file of the summary table, wherein the xml file of the table corresponds to a table building script for generating the table, and the summary table xml file explains the table building path and other information.

The algorithm modeling is realized by adopting sql, an algorithm, an sql file, an algorithm, an xml file and an algorithm, a conf file are required to be configured, and the three files are mutually associated to form an instantiated sql algorithm.

The task modeling comprises a task xml file and a virtual task xml file, and the tasks are generated according to the xml file and comprise a data driving task and a timing task.

Specifically, the ADMA application algorithm unit includes: the system comprises a table modeling module, an algorithm modeling module, a first task modeling module and a second task modeling module;

the table modeling module is used for creating a spark table;

the algorithm modeling module is used for providing an sql algorithm;

and the second task modeling module is used for creating the ETL task.

the ETL tasks include: user information ETL task.

Specifically, the table modeling module is specifically configured to create the user information spark table, the user information preprocessing spark table, and the user total number index spark table according to an xml file of a table and an xml file of a summary table;

Therefore, the ADMA application algorithm is standardized, and only jar packages, configuration files, data table design, algorithms and ETL rules need to be provided when an external fusion project is performed, so that the development cost is reduced. Wherein, jar package, configuration file: is the bottom layer for instantiating and running tasks; designing a data table: storing the xml file of the table and the xml file of the summary table; the algorithm is as follows: storing an algorithm, an xml file, an algorithm shell script and the like; ETL rule: and storing the configuration file and the script of the ETL and the like.

The ADMA application algorithm unit is further used for synchronizing a user information spark table, a user information preprocessing spark table, a user total number index spark table, a user information data mapping task, a user information preprocessing task, a user total number index task and a user information ETL task which are stored in the storage unit to the management portal; after the table is created, the table is synchronized to a portal, and the task synchronization means that: all tasks can be instantiated regularly every day and displayed on the management portal.

For example, the items are exposed on the management portal in categories of local points (such as Sichuan), formats (such as vinsight), items (such as statistics servers), task types (such as data mapping), and the like.

The management portal is further used for performing blood vessel analysis, monitoring the task execution state and re-executing the task on the user information spark table, the user information preprocessing spark table, the user total number index spark table, the user information data mapping task, the user information preprocessing task, the user total number index task and the user information ETL task. The blood relationship analysis means that a flow is visually displayed on a portal, for example, to generate a user total number index spark table, it is necessary to ensure that data exists in the spark table preprocessed by user information, otherwise, the spark table is not executed, and then the relationship is visually displayed on the portal. In addition, there is no condition for re-executing the task, but the spark table is not changed after the task is re-executed if there is no data. The re-executed tasks are a data mapping task, a data preprocessing task and a data total index task, and the re-executed input is not necessarily original data.

For example, blood relationship analysis is performed by combining instantiated tasks, and the relation between various tasks and original data and a spark table is analyzed; the task execution status can also be monitored, such as whether the task is executed, successful or failed; tasks can also be re-executed, for example, tasks which are not executed, are executed successfully, and are not executed successfully can be re-executed again.

And the management portal is also used for supporting the supplement of the original data of the user information when the original data of the user information is not introduced in time. For example, the user information original data is transmitted to the data acquisition module at regular time every day, if the user information original data is not transmitted due to network disconnection or other reasons in the middle, then the network is recovered, the original data can be transmitted to the data acquisition module, and the data which is delayed to arrive is processed. And supplementing and acquiring tasks including an ETL module and tasks such as data mapping.

For example, for an abnormal situation that the raw data is not introduced into the statistical analysis system in time, the complementary mining is supported.

In addition, the management portal also supports analysis or management functions such as dynamic resource management, data quality, operation and maintenance KPI (key performance indicator) alarm and the like, and realizes the data management capability of the large data platform.

The technical scheme provided by the embodiment of the invention provides a Spark and ADMA (address mapping association) based data statistical analysis system, realizes data management of a large data platform, provides a bottom layer support platform to integrate other projects, provides different data statistical analysis indexes for the development of each project, and reduces the development and maintenance cost of the statistical analysis system.

Fig. 3 is a schematic flow chart of a data statistical analysis method according to an embodiment of the present invention, as shown in fig. 3, the method includes:

step 301, an ADMA application algorithm unit creates a spark table, spark tasks and ETL tasks;

Wherein the creating a spark table comprises: creating the user information spark table, the user information preprocessing spark table and the user total number index spark table according to the xml file of the table and the xml file of the summary table;

the creating spark task comprises the following steps: according to a configuration algorithm, an sql file, an algorithm, an xml file and an algorithm, a conf file adopts sql to instantiate an sql algorithm; creating a user information data mapping task, a user information preprocessing task and a user total number index task by adopting the instantiated sql algorithm according to the task xml file and the virtual task xml file;

the ETL task is created by: and creating a user information ETL task according to the ELT rule.

Wherein, the method also comprises:

introducing user information original data, and calling the user information ETL task to process the user information original data;

calling the user information data mapping task to perform data mapping on the processed data to a user information spark table, calling the user information preprocessing task to perform data preprocessing on the processed data and write the preprocessed data to the user information preprocessing spark table, calling a user total number index task to perform index data aggregation on the processed data and write the processed data to the user total number index spark table; then, the user information spark table, the user information preprocessing spark table and the user total number index spark table are saved.

Wherein, the method also comprises:

and displaying the user information data mapping task, the user information preprocessing task, the user total number index task and the user information ETL task in a classified mode.

Wherein, the method also comprises:

performing blood margin analysis, monitoring task execution state and re-executing tasks on a user information spark table, a user information preprocessing spark table, a user total index spark table, a user information data mapping task, a user information preprocessing task, a user total index task and a user information ETL task.

Wherein, the method also comprises:

and when the user information original data is not introduced in time, the method supports the supplement of the user information original data.

FIG. 4 is a flow chart of a statistical data analysis method according to another embodiment of the present invention,

the present embodiment is applied to the system shown in fig. 3

As shown in fig. 4, the method includes:

step 401, an ADMA application algorithm unit creates a spark table, spark tasks and ETL tasks;

wherein the spark table comprises: a user information spark table, a user information preprocessing spark table and a user total number index spark table; the spark task comprises the following steps: the method comprises the steps of mapping a user information data task, preprocessing a user information task and a user total number index task; the ETL tasks include: user information ETL task;

in particular, ADMA applies algorithms, including jar packs, table designs, algorithms, and the like. After the service is started, performing table modeling, and generating a user information spark table, a user information preprocessing spark table and a user total number index spark table; and performing task modeling, and generating a user information ETL task, a user information data mapping task, a user information preprocessing task and a user total number index task.

Step 402, a data acquisition unit introduces user information original data and outputs the user information original data to the ADMA application algorithm unit;

step 403, the ETL module of the ADMA application algorithm unit calls a user information ETL task to process the user information original data and output the user information original data to a calculation module;

specifically, the ETL module calls 401 an instantiated ETL task of user information to process original data of the user information, after data extraction, data accuracy verification and data conversion, non-conforming records are removed, conforming records are reserved, and the records are output to the calculation module;

step 404, the computing module calls the user information data mapping task to perform data mapping on the processed data to a user information spark table, calls the user information preprocessing task to perform data preprocessing on the processed data and write the preprocessed data to the user information preprocessing spark table, and calls a user total index task to perform index data aggregation on the processed data and write the processed data to the user total index spark table; then saving the user information spark table, the user information preprocessing spark table and the user total number index spark table to the storage unit;

specifically, the computing module invokes the user information data mapping task instantiated in step 401, and maps the data into the user information spark table of the storage module; triggering a user information preprocessing task, and writing data into a user information preprocessing spark table; triggering a task of a user total number index task, writing index data into a user total number index spark table, and storing the index data into a storage unit, such as an HDFS (Distributed File System) in a File manner;

step 405, the ADMA application algorithm unit synchronizes the user information spark table, the user information preprocessing spark table, the user total number index spark table, the user information data mapping task, the user information preprocessing task, the user total number index task and the user information ETL task stored in the storage unit to the management portal;

and 406, the management portal displays the user information data mapping task, the user information preprocessing task, the user total number index task and the user information ETL task in a classified manner.

Specifically, the management portal can see the task execution time, the task execution mode, the original data (i.e., the relationship between the blood vessels) used by the task, and the like. For example, the ETL task of the user information is a timing execution mode, and the data mapping task, the preprocessing task of the user information, and the index task of the total number of users are data driving modes.

Wherein, the method can also comprise:

the management portal carries out blood margin analysis, task execution state monitoring and task re-execution on a user information spark table, a user information preprocessing spark table, a user total index spark table, a user information data mapping task, a user information preprocessing task, a user total index task and a user information ETL task;

specifically, blood margin analysis is carried out by combining instantiated tasks, and the relationship between the tasks and original data and the spark table is realized; monitoring the task execution state, such as whether the task is executed, executed successfully or failed; the re-execution tasks, such as tasks that have not been executed, have been successfully executed, and have failed to be executed, can be re-executed. For example, logging in a big data platform management portal, checking a user total number index task, performing blood relationship analysis, analyzing a task with execution failure, checking the reason of the failure for the task with execution failure, and re-executing. The relationships between the various tasks and data (spark tables) are automatically exposed on the portal after spark tables and tasks are instantiated. Here, the blood-related analysis means that after a task has a problem, the position of the problem can be located according to the relationship analysis shown on the portal. For tasks that are executed successfully or not, no condition is required and the tasks can be re-executed.

In addition, the data complementary collection is supported for the abnormal condition that the original data is not introduced into the statistical analysis system in time. Other fault analysis and resolution operations may also be performed through the management portal.

According to the technical scheme provided by the embodiment of the invention, the data statistical analysis method based on Spark and ADMA is provided, firstly, an ADMA application algorithm adopts a standardized version, a data processing mechanism is unified, different statistical projects are convenient to fuse, only codes need to be developed under a directory corresponding to the ADMA application algorithm, ADMA service can generate a timing task and a data driving task according to the codes under the directory, fragmentation of each project is solved, and the development workload is greatly reduced. And secondly, the large data platform management portal provides analysis or management functions such as unified metadata management, visual task monitoring, visual blood relationship analysis, dynamic resource management, data quality, operation and maintenance KPI (key performance indicator) alarm and the like, realizes the data management capability of the large data platform, and reduces the development and maintenance cost of a statistical analysis system.

It will be understood by those of ordinary skill in the art that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. In a hardware implementation, the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed by several physical components in cooperation. Some or all of the components may be implemented as software executed by a processor, such as a digital signal processor or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art.

Claims

1. A system for statistical analysis of data, comprising: an ADMA application algorithm unit;

the ADMA application algorithm unit includes:

the table modeling module is used for creating a spark table;

the algorithm modeling module is used for providing an sql algorithm;

and the second task modeling module is used for creating the ETL task.

2. The system of claim 1,

the spark table comprises: a user information spark table, a user information preprocessing spark table and a user total number index spark table;

the ETL tasks include: user information ETL task.

3. The system of claim 2,

4. The system of claim 3,

the ELT rule comprises an xml file of the table, an xml file of a summary table, a configuration algorithm, an sql file, an algorithm, an xml file and an algorithm, a conf file, a task xml file and a virtual task xml file, wherein the ELT rule adopts standardized versions.

5. The system of claim 3, further comprising:

the device comprises a data acquisition unit and a storage unit;

6. The system of claim 5,

7. The system of claim 5, further comprising: a management portal;

the ADMA application algorithm unit is also used for synchronizing a user information spark table, a user information preprocessing spark table and a user total number index spark table which are stored in the storage unit, and a user information data mapping task, a user information preprocessing task, a user total number index task and a user information ETL task to the management portal;

8. The system of claim 7,

the management portal is also used for carrying out blood margin analysis on the user information spark table, the user information preprocessing spark table, the user total number index spark table, the user information data mapping task, the user information preprocessing task, the user total number index task and the user information ETL task, monitoring the task execution state and re-executing the task.

9. The system of claim 7,

10. A method of statistical data analysis, comprising: