CN114936150A

CN114936150A - Big data stream synchronization and monitoring test method, device and storage medium

Info

Publication number: CN114936150A
Application number: CN202210458474.5A
Authority: CN
Inventors: 丛玉娟; 陈勇; 叶协彪
Original assignee: Zhejiang Haohan Energy Technology Co ltd; Zhejiang Geely Holding Group Co Ltd
Current assignee: Zhejiang Haohan Energy Technology Co ltd; Zhejiang Geely Holding Group Co Ltd
Priority date: 2022-04-27
Filing date: 2022-04-27
Publication date: 2022-08-23

Abstract

The invention relates to a method, a device and a storage medium for synchronizing and monitoring a big data stream, wherein the method comprises the following steps: synchronizing a database of a production environment to a test environment to complete initialization of a test environment database, wherein the database comprises a business mysql database, a big data hive database and an elastic search database; in the testing process, consistency testing of the service mysql library, the big data hive library and the elastic search library in the data circulation process is executed based on pre-configured data rules, wherein the consistency testing comprises field consistency checking and data volume monitoring between the service mysql library and the big data hive library, data consistency checking between the service mysql library and the elastic search library, data consistency checking between the big data hive library and the elastic search library, and field information checking of the service mysql library in the testing process; a data quality report is generated based on the conformance test result. Compared with the prior art, the method has the advantages of ensuring the consistency of data transfer in the test process and the like.

Description

Big data stream synchronization and monitoring test method, device and storage medium

Technical Field

The invention relates to the field of data monitoring, in particular to a method, a device and a storage medium for synchronizing and monitoring a big data stream.

Background

With the development of the times, the popularization of information-based interconnection and the rise of big data concepts. Many enterprises derive big data services based on the original services. How to ensure the correctness of the service data and the quality of the process of transferring the service data to a large data stream is a difficult problem of the current data test. During testing, data of a production environment needs to be initialized to a test environment, and then the quality of the data is monitored in the data transferring process of the test environment. Due to the fact that the flow direction of the service data is diversified, verification of data quality is usually achieved in a manual script adding mode, testing time is long, difficulty is high, mistakes are easy to make, and the problem that positioning is difficult after the mistakes are made is solved.

Therefore, how to ensure the quality of the test environment data and the quality of the data in the circulation process solves the problems of long test period and high complexity, and is a technical problem to be solved currently.

Disclosure of Invention

The invention aims to provide a method, a device and a storage medium for synchronizing and monitoring a big data stream.

The purpose of the invention can be realized by the following technical scheme:

a big data stream synchronization and monitoring test method comprises the following steps:

synchronizing a database of a production environment to a test environment to complete initialization of a test environment database, wherein the database comprises a business mysql database, a big data hive database and an elastic search database;

in the testing process, consistency testing of the service mysql library, the big data hive library and the elastic search library in the data circulation process is executed based on pre-configured data rules, wherein the consistency testing comprises field consistency checking and data volume monitoring between the service mysql library and the big data hive library, data consistency checking between the service mysql library and the elastic search library, data consistency checking between the big data hive library and the elastic search library, and field information checking of the service mysql library in the testing process;

a data quality report is generated based on the conformance test result.

The field consistency check between the service mysql library and the big data hive library comprises the following steps:

acquiring a base table of the service mysql, and scanning and generating a first dictionary based on the acquired base table;

acquiring a big data hive base table, and scanning and generating a second dictionary based on the acquired base table;

and comparing the first dictionary with the second dictionary, and outputting a comparison result as a field consistency check result.

Data volume monitoring between the service mysql library and the big data hive library comprises the following steps:

respectively calculating the total data volume and the newly added data volume of the service mysql library and the big data hive library based on the aggregation function;

and obtaining a data volume monitoring result based on the total data volume and the newly-added data volume obtained by calculation.

The data consistency check between the service mysql library and the elastic search library comprises the following steps:

acquiring a data item to be compared, and generating query conditions of the data item in a service mysql library and an elastic search library respectively;

respectively querying results in a service mysql library and an elastic search library based on the obtained query conditions;

and comparing the total amount and the single item based on the query results of the service mysql library and the elastic search library, and outputting a comparison result as the first data consistency check.

The data consistency check between the big data hive library and the elastic search library comprises the following steps:

acquiring data items to be compared, and generating query conditions of the data items in a big data hive library and an elastic search library respectively;

respectively obtaining query results of a big data hive library and an elastic search library based on the obtained query conditions;

and comparing the total amount and the single item based on the query results of the big data hive library and the elastic search library, and outputting the comparison result as second data consistency check.

And the data items to be compared are selected in a random mode.

The method further comprises the following steps:

and sending the data quality report sending designation to the designated terminal.

Generating a data quality report based on the conformance test result, comprising:

obtaining a consistency test result;

and generating a visualized data quality report according to the data of the consistency test result based on the pre-configured report template.

A big data stream synchronization and monitoring test device comprises a memory, a processor and a program stored in the memory, wherein the processor executes the program to realize the method.

A storage medium having stored thereon a program which, when executed, implements the method as described above.

Compared with the prior art, the invention has the following beneficial effects:

1. the method comprises the steps of completing initialization of a test environment based on production environment data, purifying the test environment, and performing consistency test based on consistency test, wherein the consistency test comprises field consistency check and data quantity monitoring between a service mysql library and a big data hive library, data consistency check between the service mysql library and an elastic search library, data consistency check between the big data hive library and the elastic search library, and field information check of the service mysql library in the test process, so that the quality of data circulation in the test process is realized, the test effect is improved, and the test cost is reduced.

2. The method realizes the verification of the field consistency between the service mysql library and the big data hive library based on the dictionary, and can improve the verification rate.

3. And the data volume monitoring is based on the total and incremental verification, so that the verification efficiency is improved.

4. The total amount and the single query result are verified by utilizing the query conditions, so that the consistency in the data circulation process can be effectively guaranteed.

Drawings

FIG. 1 is a schematic diagram of a system architecture in an embodiment of the present application;

FIG. 2 is a schematic diagram of a data relay service;

FIG. 3 is a schematic diagram of a data source configuration flow in an example;

FIG. 4 is a schematic diagram illustrating a data rule configuration in an example;

FIG. 5 is a schematic flow diagram of a quality reporting module;

FIG. 6 is a schematic flow chart of the operational data flow monitoring rules;

FIG. 7 is a schematic diagram of a timing task performing fault tolerance process;

Detailed Description

The invention is described in detail below with reference to the figures and specific embodiments. The present embodiment is implemented on the premise of the technical solution of the present invention, and a detailed implementation manner and a specific operation process are given, but the scope of the present invention is not limited to the following embodiments.

As shown in fig. 1, the bottom layer is the connection of data sources, the second layer is the service layer, and the third layer is the user layer, which is mainly the function of the ui layer.

The data source connection is realized through a data source configuration and connection module, and the data source connection is all the basis. Mysql, hive, es, redis are supported, and extensions of other data sources are also supported. The bottom layer first implements the connection logic of each data source. The connection of the data sources is the basis for data synchronization and data monitoring tests. As shown in fig. 3, in the data source configuration, whether the data source is successfully connected needs to be tested when the data source is configured, and if the data can be proved by the connection success, the data can be added to the library table of the data source. And if the connection can not be successfully carried out, giving an error prompt. Thereby ensuring the availability and correctness of the data source. The configuration of the data source is the basis of all operations, including data synchronization, and monitoring and testing of data. Specifically, the system sends a request for connecting the database, acquires corresponding information in the database, and realizes the connection of the configuration data source. In the data source configuration, information should be included: data source type, data source address, port, associated user name, password. The information is submitted to the connection module to test whether the connection is successful or not, and a prompt is given when the connection is failed. After the data source connection test is successful, the information of the data source can be added into the configuration of the data source. Different data sources have different connection modes in python and are independently arranged, and if multiple connection modes exist, stable connection is selected.

The business layer comprises data transfer service, data rule monitoring and testing, base table and field analysis, task configuration execution, result notification and data quality report.

In order to ensure the safety of online data, an online environment and a test environment are generally isolated by data, the test environment is a database which cannot directly access the online environment, and a test system is generally built in the test environment, so that if data synchronization operation is required to be performed from the online environment, a special data transfer service is required. And after the data synchronization is completed, the data initialization of the test environment is completed. As shown in fig. 2, in the data relay service, a set of web services is established locally to connect to the VPN, so that the local service and the online environment are in the same network segment. And after the data synchronization is completed, the data initialization of the test environment is completed.

The data rule monitoring test, as shown in fig. 4, is based on rule configuration, and in the test process, consistency tests of the service mysql library, the big data hive library and the elastic search library in the data circulation process are performed based on preconfigured data rules, wherein the consistency tests include field consistency check and data volume monitoring between the service mysql library and the big data hive library, data consistency check between the service mysql library and the elastic search library, data consistency check between the big data hive library and the elastic search library, and field information check of the service mysql library in the test process.

The monitoring test of the business mysql and the big data hive is embodied in field consistency monitoring and data volume monitoring, and the field consistency monitoring relates to a base table and a field analysis module. The data of the library table is obtained through show databases, show tables. The table structure can be obtained through desc table name, and the field name, the type and the number of the fields in the table structure can be used as comparison in the data stream conversion process.

It is assumed that there is a monitoring of consistency of comparison fields in the data rule monitoring, including 1 mysql library and 1 hive library. The monitored base tables need to be preconfigured with rules. The specific implementation is then: and acquiring a library needing to be monitored in the service mysql library and a table needing to be monitored in the database. And acquiring a table structure through desc table name, processing data, and storing the table structure into a dictionary format through a library, a table and a field. For example { "database1" { "visible 1" { "column1", "column2", "column3" }, "database2" { "visible 2" { "column1", "column2", "column3" }. And scanning all the base tables and storing the base tables into a dictionary format. The table structure of the monitoring base table in hive is obtained in the same way. And splitting and comparing the two dictionaries according to the key values, summarizing and comparing according to the key values, and assuming that a library table without monitoring is arranged in the hive library in the circulation process from mysql to hive, the abnormal condition also needs to be included in the setting of the program.

The data volume monitoring is embodied in that the total data volume and the newly added data volume are counted through an aggregation function count, and the counting mode is that the total data volume is calculated by a certain time node and is less than or equal to the data volume of the certain time node. The newly added data volume is the data volume of two time intervals, and the latest partition should be selected by the hive statistics.

The data consistency monitoring test of the service mysql and the elastic search is mainly counted according to condition query, and the data consistency monitoring test of the big data hive and the elastic search is also a mode of performing aggregation query by taking conditions. In addition to performing aggregate query of data amount by using count, the method also performs data attribute check on a certain piece of data. Only through the comparison of the accuracy check of single information and the data quantity of the whole condition information, the consistency of the data can be better ensured. In this embodiment, the data items to be compared are selected in a random manner, but in other embodiments, a traversal manner may also be adopted.

In addition, for querying the data of the elastic search, the present embodiment is implemented by the following script:

the connection to the elastic search is established by the python library elastic search.

The http _ auth may not be written if there is no authentication information.

After connection, data can be queried in a query mode.

source_es.search(index＝index,body＝query,scroll＝'20m',request_timeout＝20,size＝size)

index is the index that needs to be queried, and the content of query is written in body.

Such as { "query": { "match": { "dt": 20211128 "}).

After query, the statistics of the magnitude can be obtained by [ 'hits' ] [ 'total' ] in the result, and the single information [ 'hits' ] [ 'hits' ] obtains the content therein.

The test monitoring of the data generated by the service in the test process aims at the monitoring test of the null value and the abnormal value of the set field. And adding fields needing to be monitored into the rules, wherein if the name fields in a certain table cannot be null values, the configured rules need to be added into the tasks after the configured monitoring rules are completed.

The task can set time and execute the rules configured in the task regularly.

And summarizing and analyzing the results of the data monitoring test of the data quality report, for example, counting the number of the monitored base tables, wherein the number of the monitored base tables is abnormal and inconsistent, loading the summarized data into echarts, displaying the data in a charification mode, and generating the data quality report. And then sending the data quality report to an appointed group and @ relevant personnel to finish the notification of the result.

Figure 5 shows a quality reporting module.

And the quality report module collects and counts the rule execution results, then calls an echarts component to generate a visual chart, and integrates a plurality of charts to generate a quality report. It is clear that the graph has a meaning that not only the quality report can be visually seen, but also the point of abnormal data can be positioned on the basis of the quality report, for example, in a field check rule, the difference of quantity can be clearly analyzed, and the difference of a certain field of a certain table in a certain library can also be positioned, so that the problem can be timely and conveniently corrected.

Fig. 6 is a flow of setting data flow rules to quality report transmission.

Configuring data rules, executing the rules, if the execution is successful, adding the rules into the timing task, and if the execution is failed, checking whether the rule configuration has problems.

The configuration can be added to the timing task after it is successful.

After the timing task is executed, the summarizing result is generated into a quality report, and the quality report is sent to a designated group, so that related personnel can obtain a report of the data stream quality in time.

FIG. 7 is a fault tolerant process for timed task execution. Often, the execution of the task includes the execution of a plurality of rules, if a certain rule fails to execute, the execution of the whole task is not influenced, the rule which fails to execute is subjected to fault-tolerant processing and presented in a quality report in a fault exception mode, and the execution exception can be tracked according to a log, so that the problem is conveniently and quickly positioned.

The method and the device build big data synchronization and data flow circulation monitoring based on web services. And the data transfer service is used for completing the synchronization of the data from the production environment to the test environment or the pre-sending environment, and completing the initialization work of the data. And then, the consistency, the integrity and the correctness of the data in the circulation process are verified by configuring a test method for connecting a plurality of data sources and setting a verification rule. And finally, feeding back the monitoring test result to the designated group. The quality of data can be rapidly verified by a tester only by setting a monitoring rule according to the service, so that the testing efficiency is greatly improved, and the problem troubleshooting time is greatly shortened.

The above functions, if implemented in the form of software functional units and sold or used as a separate product, may be stored in a computer-readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk, and various media capable of storing program codes.

Claims

1. A big data stream synchronization and monitoring test method is characterized by comprising the following steps:

a data quality report is generated based on the conformance test result.

2. The big data stream synchronization and monitoring test method as claimed in claim 1, wherein the field consistency check between the service mysql library and the big data hive library comprises:

3. The big data stream synchronization and monitoring test method as claimed in claim 1, wherein the monitoring of data volume between the service mysql library and the big data hive library comprises:

and obtaining a data volume monitoring result based on the total data volume and the newly-added data volume obtained through calculation.

4. The big data stream synchronization and monitoring test method according to claim 1, wherein the data consistency check between the service mysql library and the elastic search library comprises:

acquiring data items to be compared, and generating query conditions of the data items in a business mysql library and an elastic search library respectively;

respectively obtaining query results of the query conditions in a business mysql library and an elastic search library;

5. The big data stream synchronization and monitoring test method according to claim 1, wherein the data consistency check between the big data hive library and the elastic search library comprises:

respectively obtaining query results of query conditions in a big data hive library and an elastic search library;

6. A big data stream synchronization and monitoring test method according to claim 4 or 5, wherein the data items to be compared are selected randomly.

7. The big data stream synchronization and monitoring test method according to claim 1, further comprising:

8. The big data stream synchronization and monitoring test method according to claim 1, wherein the generating a data quality report based on the consistency test result comprises:

obtaining a consistency test result;

9. A big data stream synchronization and monitoring test device comprising a memory, a processor, and a program stored in the memory, wherein the processor executes the program to implement the method of any of claims 1-8.

10. A storage medium having a program stored thereon, wherein the program, when executed, implements the method of any of claims 1-8.