CN114185883A

CN114185883A - Data monitoring method, device and equipment of data warehouse and storage medium

Info

Publication number: CN114185883A
Application number: CN202111550614.3A
Authority: CN
Inventors: 王堃; 刘莹浓; 周伟; 范子龙
Original assignee: Ping An Medical and Healthcare Management Co Ltd
Current assignee: Ping An Medical and Healthcare Management Co Ltd
Priority date: 2021-12-17
Filing date: 2021-12-17
Publication date: 2022-03-15

Abstract

The application relates to the technical field of big data, and discloses a data monitoring method, a device, equipment and a storage medium of a data warehouse, wherein the method comprises the following steps: responding to the data synchronization request, and updating the data volume statistical table according to the number of the incremental data pieces of the source table to be stored and the total data pieces of the source table to be stored, which are respectively corresponding to each data table to be uploaded in the target service system; responding to the data synchronization request, updating a data warehouse according to each data table to be uploaded, and updating a data volume statistical table and a data abnormal value check table according to the data updated to the data warehouse; judging the abnormality of the data volume according to the target time range and the data volume statistical table to obtain a data volume monitoring judgment result; and judging the abnormal value of the data according to the target time range and the data abnormal value check table to obtain a data monitoring judgment result. The data volume and the abnormal value of the data synchronized to the data warehouse are automatically monitored in real time, and the timeliness of data monitoring is improved.

Description

Data monitoring method, device and equipment of data warehouse and storage medium

Technical Field

The present application relates to the field of big data technologies, and in particular, to a data monitoring method, apparatus, device, and storage medium for a data warehouse.

Background

At present, data monitoring systems of data warehouses of large data platforms of all projects are lack, the number of data synchronized every day and the size of data volume are not monitored, the number of data synchronized per day accords with standards and is not known, and the number of tables of data synchronized per day and the number of tasks performed integrally are not counted, so that the resource use condition of each project cannot be scientifically evaluated, and the data synchronization condition is in a fuzzy state.

The traditional data statistics method is to access a Hive (data warehouse tool) table, and Count the incremental data amount, the full data amount and the dirty data amount of the current day through a Count function (Count). However, both Hive SQL (SQL is a structured query language) and Spark SQL (Spark is a fast and general computing engine designed for large-scale data processing) have a slow speed of the counting function, because Hive SQL is computed based on MapReduce (data mining), and Spark SQL triggers an Action (which triggers the running of Spark jobs and actually triggers the computation of a conversion operator) operation. Each table is subjected to counting operation, so that the execution of thousands of counting function statements is slow and the timeliness of data monitoring is low.

Disclosure of Invention

The application mainly aims to provide a data monitoring method, a data monitoring device, data monitoring equipment and a storage medium of a data warehouse, and aims to solve the technical problems that in the prior art, a counting function is adopted to count the increment data volume, the full data volume and the dirty data volume of the day to monitor the abnormity of the data warehouse by accessing a Hive table, the speed of the counting function is low, and the timeliness of data monitoring is low.

In order to achieve the above object, the present application provides a data monitoring method for a data warehouse, the method including:

acquiring a data synchronization request of a target service system;

responding to the data synchronization request, and updating a data quantity statistical table according to the number of the incremental data pieces of the source table to be stored and the total data pieces of the source table to be stored, which are respectively corresponding to each data table to be uploaded in the target service system;

responding to the data synchronization request, updating a data warehouse according to each data table to be uploaded, and updating the data volume statistical table and the data abnormal value check table according to the data updated to the data warehouse;

acquiring an abnormality judgment request, wherein the abnormality judgment request carries a target time range;

performing data quantity abnormity judgment according to the target time range and the data quantity statistical table to obtain a data quantity monitoring judgment result;

and judging the abnormal value of the data according to the target time range and the data abnormal value check table to obtain a data monitoring judgment result.

Further, the step of updating the data volume statistical table according to the number of the incremental data pieces of the source table to be stored and the number of the full data pieces of the source table to be stored, which correspond to each data table to be uploaded in the target service system, includes:

acquiring the latest historical synchronous data time corresponding to the target service system, and determining a target screening condition according to the latest historical synchronous data time and the generation time of the data synchronous request;

acquiring the number of incremental data corresponding to each data table to be uploaded as the number of incremental data of the source table to be stored according to the target screening condition;

acquiring the number of the total data corresponding to each data table to be uploaded as the number of the source table total data to be stored according to the generation time;

and adding the table name of the data table to be uploaded, the number of the incremental data of the source table to be stored and the number of the full data of the source table to be stored into the data volume statistical table as associated data.

Further, the step of updating the data volume statistics table and the data outlier check table according to the data updated to the data warehouse includes:

updating the data quantity statistical table according to Hive metadata corresponding to the data tables to be uploaded in the data warehouse;

and updating the data abnormal value check table according to a data cache layer base table, a data pasting layer base table and a data detail layer base table which correspond to each to-be-uploaded data table in the data warehouse.

Further, the step of updating the data volume statistics table according to the Hive metadata corresponding to each to-be-uploaded data table in the data warehouse includes:

obtaining table names corresponding to the data tables to be uploaded from Hive metadata in the data warehouse to obtain a table name set to be processed;

acquiring a table name in the table name set to be processed as a table name to be processed;

acquiring a table name of a data cache layer base to be stored, the number of incremental data of the data cache layer base table to be stored, the name of the data source layer base to be stored, the total number of data of the data source layer base table to be stored, the total number of data of the detailed layer base table of the data to be stored and the number of dirty data to be stored from a large data platform management data table according to the table name to be processed;

and updating the data quantity statistical table by taking the name of the data cache layer base table to be stored, the number of incremental data of the data cache layer base table to be stored, the name of the data pasting layer base table to be stored, the total number of data of the data detail layer base table to be stored and the number of dirty data to be stored as associated data.

Further, the step of updating the data abnormal value check table according to the data cache tier base table, the data pasting tier base table and the data detail tier base table corresponding to each to-be-uploaded data table in the data warehouse includes:

obtaining table names corresponding to the data tables to be uploaded from Hive metadata in the data warehouse to obtain a table name set to be analyzed;

acquiring any table name in the table name set to be analyzed as a table name to be analyzed;

searching a service primary key from the newly added data of the data cache layer base table corresponding to each data table to be uploaded on the same day according to the name of the table to be analyzed to obtain a target service primary key;

when the target service key is a null character string or a null object, determining that the key abnormity judgment result is abnormal, otherwise, determining that the key abnormity judgment result is normal;

searching for creation time from the newly added data of the data cache layer base table corresponding to each data table to be uploaded on the same day according to the name of the table to be analyzed to obtain first creation time;

when the first creation time is an empty character string or an empty object, determining that a time abnormity judgment result is abnormal, otherwise, determining that the time abnormity judgment result is normal;

judging whether the name of the table to be analyzed has a service primary key in the data pasting layer base table corresponding to each data table to be uploaded to obtain a service primary key searching result;

when the service main key searching result exists, searching for creating time from the newly added data of the data source layer base table corresponding to each data table to be uploaded on the same day according to the name of the table to be analyzed to obtain second creating time;

when the first creation time is the same as the second creation time, determining that the main key repetition judgment result is not repeated, otherwise, determining that the main key repetition judgment result is repeated;

and updating the data abnormal value check table according to the name of the table to be analyzed, the main key abnormal judgment result, the time abnormal judgment result and the main key repeated judgment result.

Further, the step of performing an abnormal judgment on the data volume according to the target time range and the data volume statistical table to obtain a data volume monitoring judgment result includes:

acquiring each associated data to be analyzed from the data volume statistical table according to the target time range;

taking any one of the associated data to be analyzed as target associated data;

when the number of the source table increment data of the target associated data is different from the number of the data cache layer base table increment data, determining that an increment judgment result corresponding to the target associated data is abnormal, otherwise, determining that the increment judgment result is normal;

when the sum of the total data number and the dirty data number of the data detail layer base table of the target associated data is different from the total data number of the data source layer base table of the target associated data, determining that the total judgment result corresponding to the target associated data is abnormal, otherwise, determining that the total judgment result is normal;

generating an incremental trend graph according to the number of the incremental data of the data cache layer base table of each piece of the associated data to be analyzed;

calculating the slope of the incremental trend graph to obtain a target slope;

when the target slope is out of the obtained preset slope range, determining that an increment trend judgment result is abnormal, otherwise, determining that the increment trend judgment result is normal;

and when the increment judgment result, the full judgment result and the increment trend judgment result are normal, determining that the data quantity judgment result of the data quantity monitoring judgment result is normal, otherwise, determining that the data quantity judgment result is abnormal, and determining the reason of the data quantity abnormality of the data quantity monitoring judgment result according to the abnormal increment judgment result, the abnormal full judgment result and the abnormal increment trend judgment result.

Further, the step of judging the abnormal value of the data according to the target time range and the data abnormal value check table to obtain a data monitoring judgment result includes:

acquiring each to-be-processed associated data from the data abnormal value check table according to the target time range;

taking any one of the associated data to be processed as associated data to be judged;

and when the main key abnormity judgment result, the time abnormity judgment result and the main key repeated abnormity judgment result in the associated data to be judged are normal, determining that the data monitoring judgment result is normal, otherwise, determining that the data monitoring judgment result is abnormal.

The present application further provides a data monitoring device of a data warehouse, the device includes:

the first request acquisition module is used for acquiring a data synchronization request of a target service system;

the first updating module is used for responding to the data synchronization request and updating the data quantity statistical table according to the number of the incremental data pieces of the source table to be stored and the total data piece of the source table to be stored, which are respectively corresponding to each data table to be uploaded in the target service system;

the second updating module is used for responding to the data synchronization request, updating a data warehouse according to each data table to be uploaded, and updating the data volume statistical table and the data abnormal value checking table according to data updated to the data warehouse;

the second request acquisition module is used for acquiring an abnormity judgment request, and the abnormity judgment request carries a target time range;

the data volume monitoring judgment result determining module is used for carrying out abnormal judgment on the data volume according to the target time range and the data volume statistical table to obtain a data volume monitoring judgment result;

and the data monitoring judgment result determining module is used for judging the abnormal value of the data according to the target time range and the data abnormal value checking table to obtain a data monitoring judgment result.

The present application further proposes a computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the steps of any of the above methods when executing the computer program.

The present application also proposes a computer-readable storage medium having stored thereon a computer program which, when being executed by a processor, carries out the steps of the method of any of the above.

The data monitoring method, the device, the equipment and the storage medium of the data warehouse are characterized in that firstly, by responding to the data synchronization request, updating a data quantity statistical table according to the number of the incremental data of the source table to be stored and the total data of the source table to be stored, which are respectively corresponding to each data table to be uploaded in the target service system, then updating a data warehouse according to each data table to be uploaded by responding to the data synchronization request, updating the data volume statistical table and the data abnormal value check table according to the data updated into the data warehouse, finally performing data volume abnormal judgment according to the target time range and the data volume statistical table to obtain a data volume monitoring judgment result, and judging the abnormal value of the data according to the target time range and the data abnormal value check table to obtain a data monitoring judgment result. Therefore, the data volume and the abnormal value of the data synchronized to the data warehouse are automatically monitored in real time, the situation that the Hive table is accessed and the incremental data volume, the full data volume and the dirty data volume of the same day are counted by adopting the counting function so as to monitor the abnormal data of the data warehouse is avoided, and the timeliness of data monitoring is improved.

Drawings

Fig. 1 is a schematic flowchart of a data monitoring method for a data warehouse according to an embodiment of the present application;

fig. 2 is a schematic block diagram of a data monitoring apparatus of a data warehouse according to an embodiment of the present application;

fig. 3 is a block diagram illustrating a structure of a computer device according to an embodiment of the present application.

The implementation, functional features and advantages of the objectives of the present application will be further explained with reference to the accompanying drawings.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

Referring to fig. 1, an embodiment of the present application provides a data monitoring method for a data warehouse, where the method includes:

s1: acquiring a data synchronization request of a target service system;

s2: responding to the data synchronization request, and updating a data quantity statistical table according to the number of the incremental data pieces of the source table to be stored and the total data pieces of the source table to be stored, which are respectively corresponding to each data table to be uploaded in the target service system;

s3: responding to the data synchronization request, updating a data warehouse according to each data table to be uploaded, and updating the data volume statistical table and the data abnormal value check table according to the data updated to the data warehouse;

s4: acquiring an abnormality judgment request, wherein the abnormality judgment request carries a target time range;

s5: performing data quantity abnormity judgment according to the target time range and the data quantity statistical table to obtain a data quantity monitoring judgment result;

s6: and judging the abnormal value of the data according to the target time range and the data abnormal value check table to obtain a data monitoring judgment result.

In this embodiment, first, a data volume statistical table is updated according to the number of incremental data pieces of a source table to be stored and the number of full data pieces of the source table to be stored, which correspond to each data table to be uploaded in the target service system, by responding to the data synchronization request, a data warehouse is updated according to each data table to be uploaded, the data volume statistical table and a data abnormal value check table are updated according to data updated to the data warehouse, finally, abnormal judgment of the data volume is performed according to the target time range and the data volume statistical table, a data volume monitoring judgment result is obtained, and abnormal value judgment of the data is performed according to the target time range and the data abnormal value check table, so that a data monitoring judgment result is obtained. Therefore, the data volume and the abnormal value of the data synchronized to the data warehouse are automatically monitored in real time, the situation that the Hive table is accessed and the incremental data volume, the full data volume and the dirty data volume of the same day are counted by adopting the counting function so as to monitor the abnormal data of the data warehouse is avoided, and the timeliness of data monitoring is improved.

For S1, the data synchronization request may be sent by the target service system according to a preset condition, or may be generated by the big data platform according to a preset condition. For example, the preset condition is a preset time interval, which is not limited in this example.

The data synchronization request is a request for synchronizing the data of each data table in the database of the target business system to the data warehouse of the large data platform.

The target business system is an application system for providing business services.

It will be appreciated that the target business system has a database storing data, the database including a plurality of data tables.

For S2, responding to the data synchronization request, accessing the database of the target service system, taking each data table in the database of the target service system as a data table to be uploaded, acquiring the number of incremental data of each data table to be uploaded (i.e., the number of incremental data of the source table) from the database of the target service system, taking each acquired incremental data number as the number of incremental data of a source table to be stored, acquiring the number of full data of each data table to be uploaded (i.e., the number of full data of the source table) from the database of the target service system, and taking each acquired full data number as the number of full data of the source table to be stored; and updating the number of the incremental data of the source table to be stored and the number of the full data of the source table to be stored into corresponding fields of the data volume statistical table.

The data volume statistical table comprises: the method comprises the following steps of self-increment key, source table name, source table increment data number, source table full data number, data cache layer base table name, data cache layer base table increment data number, data pasting layer base table name, data pasting layer base table full data number, data detail layer base table full data number, dirty data number, subsystem code, source base name, increment data number difference and full data number difference. Wherein, the self-adding key is a main key, and the field name of the self-adding key is id; the source table name refers to the table name of a data table in the service system; the number of the incremental data of the source table is the number of the incremental data of the source table (the data table in the service system); the total data number of the source table is the total data number of the source table (the data table in the service system); the database table name of the data cache layer is the name of the database table of the data cache layer; the number of the incremental data of the data cache layer base table is the number of the incremental data of the data cache layer base table; the name of the data pasting layer base table is the name of the data pasting layer base table; the total data number of the data pasting source layer is the total data number of the data pasting source layer base table; the total number of the data in the data detail layer base table is the total number of the data in the data detail layer base table; the dirty data number is the number of data which do not meet the data table specification, and the data can be obtained from a large data platform management data table of the data warehouse; subsystem code, which is a service system identifier; the source library name is a service system identifier; the difference value of the number of the incremental data is the difference value of the number of the incremental data of the source table and the number of the incremental data of the data cache layer base table; the difference value of the total data number is the difference value of the total data number of the source tables and the total data number of the data pasting layer base tables.

The data of the data warehouse of the big data platform is all sourced from a business system (such as a medicine purchasing system of a pharmacy). Each table of the business system establishes a table in a data cache layer (STG) library, a data pasting layer (ODS) library and a data detail layer (DWD) library of the big data platform respectively, and the naming rule of the 3 tables is source table name + library name. The data caching layer base table is mainly used for recording data which is changed or newly added every day in the service system. The data pasting layer base table mainly collects the data of the data caching layer base table every day and eliminates the duplication. And the data detail layer library table filters some irregular data on the basis of the data pasting layer library table. Since all the 3 library tables filter some data in the data processing process, in order to prevent the data loss in the process, the situation of data loss is found, so that the number of data in the 3 library tables is recorded for judging whether the data is lost by comparing the difference between the library tables.

All the information of the data table established on the big data platform is recorded in the big data platform management data table. Table (recording table name and table id), partition table (recording table id and partition id, partition name), and partition _ params table (recording partition id and number of partition data) in a database for managing metadata in the open source big data platform; when the data volume of the data cache layer base table is required to be inquired, only the table name id in the tables is found according to the name of the data cache layer base table, and as the partition names of the data cache layer base table are named according to the date, the partition id is found in the partitions table according to the date and the table name id, and then the data number (namely the incremental data number) of the partition is found according to the partition id; when the number of data pieces of the data pasting source layer base table needs to be inquired, only the table name id in the tables is found according to the name of the data pasting source layer base table, and as the data pasting source layer base table only needs to inquire the total number of data, all the partition id can be found out in a partions table according to the table name id of the data pasting source layer base table, the number of the data pieces is searched according to the partition id, and the number of the data pieces corresponding to all the partition id is accumulated, so that the total number of data pieces (namely, the number of full-quantity data pieces) of the data pasting source layer base table is obtained; when the total number of data in the detail layer library table needs to be queried, only the table name id in the tables is found according to the table name of the detail layer library table, and because the detail layer library table only needs to query the total number of data, all the partition ids can be found out in the partitions table according to the table name id of the detail layer library table, then the number of data is searched according to the partition id, and then the number of data corresponding to all the partition ids is accumulated, so that the total number of data (namely, the number of full-amount data) in the detail layer library table is obtained.

The number of the incremental data of the source tables and the number of the total data of the source tables are obtained by accessing a database of a service system, and the name of the table of the data cache layer table, the number of the incremental data of the table of the data cache layer table, the name of the table of the data pasting layer table, the number of the total data of the table of the data pasting layer table and the number of the total data of the table of the data detail layer table are obtained through Hive metadata of a data warehouse.

Hive metadata is description data. The Hive metadata includes: data synchronization time, table name, number of incremental data pieces, and number of full data pieces.

For S3, in response to the data synchronization request, updating the data in each to-be-uploaded data table to a data warehouse, and updating the data volume statistics table and the data abnormal value check table respectively according to the data updated to the data warehouse this time.

The data outlier look-up table includes: the method comprises the steps of adding keys, storing the database table name of a data cache layer, judging whether a main key is abnormal or not, judging whether time is abnormal or not and judging whether the main key is repeated or not. Whether the primary key is abnormal or not refers to whether the service primary key in the data cache layer base table is empty or not; whether the time is abnormal or not refers to whether the data creation time in the data cache layer base table is empty or not; whether the primary key is repeated or not, and whether the service primary key exists in the data pasting layer base table or not is consistent with the creation time of the service primary key in the data caching layer base table or not.

The service primary key is a primary key formed by service logic relations of the self table of the service system.

For S4, the big data platform generates an abnormality determination request according to the preset time, which may also be an abnormality determination request sent by a third-party application system.

The abnormality judgment request is a request for performing abnormality judgment on data in the data warehouse.

The target time range includes: is the time period between the start time and the end time.

For S5, the data amount monitoring determination result includes: the data amount judgment result and the reason for the data amount abnormality. The data volume judgment result has 2 values, one of which is abnormal, and the other is normal. When the data volume judgment result is abnormal, the reason of the abnormal data volume comprises data; when the data amount judgment result is normal, the reason for the abnormal data amount does not include data.

And when the data of the data volume statistical table within the target time range meets all the first evaluation indexes in the first evaluation index set, determining that the data volume judgment result of the data volume monitoring judgment result is normal, otherwise, determining that the data volume judgment result of the data volume monitoring judgment result is abnormal.

The first evaluation index set includes: one or more evaluation indices.

For S6, a second evaluation index set is used, and abnormal value judgment of data is performed according to the data in the target time range of the creation time in the data abnormal value check table, and when the data in the target time range of the creation time in the data abnormal value check table satisfies all second evaluation indexes in the second evaluation index set, the data monitoring judgment result is determined to be normal, otherwise, the data monitoring judgment result is determined to be abnormal.

In an embodiment, the step of updating the data volume statistical table according to the number of the incremental data pieces of the source table to be stored and the number of the full data pieces of the source table to be stored, which correspond to each data table to be uploaded in the target service system, includes:

s21: acquiring the latest historical synchronous data time corresponding to the target service system, and determining a target screening condition according to the latest historical synchronous data time and the generation time of the data synchronous request;

s22: acquiring the number of incremental data corresponding to each data table to be uploaded as the number of incremental data of the source table to be stored according to the target screening condition;

s23: acquiring the number of the total data corresponding to each data table to be uploaded as the number of the source table total data to be stored according to the generation time;

s24: and adding the table name of the data table to be uploaded, the number of the incremental data of the source table to be stored and the number of the full data of the source table to be stored into the data volume statistical table as associated data.

According to the embodiment, firstly, a target screening condition is determined according to the latest historical data synchronization time and the generation time of the data synchronization request at this time, then the number of incremental data of the source table to be stored is obtained according to the target screening condition, the number of the full data of the source table to be stored is obtained according to the generation time of the data synchronization request at this time, and finally the table name of the data table to be uploaded, the number of the incremental data of the source table to be stored and the number of the full data of the source table to be stored are added into the data quantity statistical table as associated data, so that the data quantity synchronized to a data warehouse by the target service system each time is added into the data quantity statistical table, and a basis is provided for the subsequent abnormal judgment of the data quantity.

For S21, the historical latest synchronization data time corresponding to the target business system may be obtained from the database. The historical latest synchronization data time is the time when the target business system last synchronizes data to the data warehouse.

And generating a screening condition according to the latest historical synchronous data time and the generation time of the data synchronous request by adopting a preset screening condition, and taking the generated screening condition as a target screening condition.

Optionally, the preset filtering condition is that a time period between the historical latest synchronization data time and the generation time of the data synchronization request is taken as a statistical time period, where the historical latest synchronization data time is not included, and the generation time of the data synchronization request is included.

For S22, acquiring, by using the target screening condition, the number of incremental data (that is, the number of data added in the data table) corresponding to each to-be-uploaded data table as the number of incremental data of the source table to be stored. That is to say, each to-be-uploaded data table corresponds to one number of the to-be-stored source table increment data.

For example, when the preset filtering condition is that a time period between the historical latest synchronization data time and the generation time of the data synchronization request is taken as a statistical time period, the number of incremental data of each to-be-uploaded data table in the time period is obtained as the number of incremental data of the to-be-stored source table, which is not specifically limited in this example.

For S23, acquiring the total number of pieces of data of each to-be-uploaded data table at the generation time, and taking the acquired total number of pieces of data (i.e. the total number of pieces of data in the data table) as the total number of pieces of data of the source table to be stored. That is, each to-be-uploaded data table corresponds to one full data number of the to-be-stored source tables.

For S24, the table name of the data table to be uploaded, the number of the incremental data of the source table to be stored, and the number of the full data of the source table to be stored are used as associated data, and the associated data is added to the data volume statistical table, so that the associated data is used as a record in the data volume statistical table.

Optionally, the generation time of the data synchronization request is used as the data synchronization time; and adding the data synchronization time, the table name of the data table to be uploaded, the number of the incremental data of the source table to be stored and the number of the full data of the source table to be stored into the data volume statistical table as associated data.

Inserting the table name of the data table to be uploaded into the source table name in the data volume statistical table, inserting the number of the incremental data of the source table to be stored into the number of the incremental data of the source table in the data volume statistical table, and inserting the number of the total data of the source table to be stored into the number of the total data of the source table in the data volume statistical table.

In one embodiment, the step of updating the data volume statistics table and the data outlier check table according to the data updated to the data warehouse includes:

s31: updating the data quantity statistical table according to Hive metadata corresponding to the data tables to be uploaded in the data warehouse;

s32: and updating the data abnormal value check table according to a data cache layer base table, a data pasting layer base table and a data detail layer base table which correspond to each to-be-uploaded data table in the data warehouse.

In this embodiment, the data volume statistical table is updated by the Hive metadata in the data warehouse, and the data abnormal value check table is updated by the data cache tier table, the data pasting tier table and the data detail tier table in the data warehouse, so that a basis is provided for subsequently performing abnormal judgment on the data volume, and a basis is provided for subsequently performing abnormal value judgment on the data.

For S31, updating the data cache layer library table name, the number of incremental data pieces of the data cache layer library table, the data posting layer library table name, the number of full data pieces of the data posting layer library table, the number of full data pieces of the data detail layer library table, and the number of dirty data pieces of the data volume statistics table according to the hive metadata corresponding to each to-be-uploaded data table in the data warehouse.

For step S32, according to the data cache layer table name, the data posting layer table, and the data detail layer table corresponding to each to-be-uploaded data table in the data warehouse, the data cache layer table name, whether the primary key is abnormal, whether the time is abnormal, whether the primary key is repeated, the data synchronization time, the data creation time, and the data update time of the data abnormal value check table are updated.

In an embodiment, the step of updating the data volume statistics table according to the Hive metadata corresponding to each to-be-uploaded data table in the data warehouse includes:

s311: obtaining table names corresponding to the data tables to be uploaded from Hive metadata in the data warehouse to obtain a table name set to be processed;

s312: acquiring a table name in the table name set to be processed as a table name to be processed;

s313: acquiring a table name of a data cache layer base to be stored, the number of incremental data of the data cache layer base table to be stored, the name of the data source layer base to be stored, the total number of data of the data source layer base table to be stored, the total number of data of the detailed layer base table of the data to be stored and the number of dirty data to be stored from a large data platform management data table according to the table name to be processed;

s314: and updating the data quantity statistical table by taking the name of the data cache layer base table to be stored, the number of incremental data of the data cache layer base table to be stored, the name of the data pasting layer base table to be stored, the total number of data of the data detail layer base table to be stored and the number of dirty data to be stored as associated data.

In this embodiment, first, a table name corresponding to each data table to be uploaded is obtained from Hive metadata in the data warehouse, and then, data is obtained from a big data platform management data table according to the table name to update the data volume statistical table, so that a basis is provided for subsequent abnormal judgment of the data volume.

For step S311, the generation time of the data synchronization request is searched in each data synchronization time in the Hive metadata of the data warehouse, and each table name (table name corresponding to each to-be-uploaded data table) corresponding to the data synchronization time searched in the Hive metadata is used as a to-be-processed table name set.

For step S312, one table name in the table name set to be processed is sequentially acquired as the table name to be processed.

For S313, the name of the table to be processed is looked up in the big data platform management data table as the name of the table of the data cache layer to be stored, the number of the incremental data pieces of the table of the data cache layer is looked up as the number of the incremental data pieces of the table of the data cache layer to be stored, the name of the table of the data source layer is looked up as the name of the table of the data source layer to be stored, the number of the full quantity data pieces of the table of the data source layer is looked up as the number of the full quantity data pieces of the table of the data source layer to be stored, the number of the full quantity data pieces of the table of the data detail layer is looked up as the number of the full quantity data pieces of the table of the data detail layer to be stored, and the number of the dirty data pieces is looked up as the number of the stored dirty data pieces.

For step S314, the name of the to-be-stored data cache layer library table, the number of incremental data of the to-be-stored data cache layer library table, the name of the to-be-stored data posting layer library table, the total number of data of the to-be-stored data detail layer library table, and the number of dirty data of the to-be-stored data detail layer library table are used as associated data, and the associated data is used as a record to update each field in the data volume statistical table.

Optionally, the data quantity statistical table is updated by using the data synchronization time (that is, the generation time of the data synchronization request), the name of the to-be-stored data cache layer base table, the number of incremental data of the to-be-stored data cache layer base table, the name of the to-be-stored data posting layer base table, the number of the to-be-stored full data of the data detail layer base table, and the number of the to-be-stored dirty data as associated data.

In an embodiment, the step of updating the data abnormal value check table according to the data cache tier table, the data posting source tier table, and the data detail tier table corresponding to each to-be-uploaded data table in the data warehouse includes:

s321: obtaining table names corresponding to the data tables to be uploaded from Hive metadata in the data warehouse to obtain a table name set to be analyzed;

s322: acquiring any table name in the table name set to be analyzed as a table name to be analyzed;

s323: searching a service primary key from the newly added data of the data cache layer base table corresponding to each data table to be uploaded on the same day according to the name of the table to be analyzed to obtain a target service primary key;

s324: when the target service key is a null character string or a null object, determining that the key abnormity judgment result is abnormal, otherwise, determining that the key abnormity judgment result is normal;

s325: searching for creation time from the newly added data of the data cache layer base table corresponding to each data table to be uploaded on the same day according to the name of the table to be analyzed to obtain first creation time;

s326: when the first creation time is an empty character string or an empty object, determining that a time abnormity judgment result is abnormal, otherwise, determining that the time abnormity judgment result is normal;

s327: judging whether the name of the table to be analyzed has a service primary key in the data pasting layer base table corresponding to each data table to be uploaded to obtain a service primary key searching result;

s328: when the service main key searching result exists, searching for creating time from the newly added data of the data source layer base table corresponding to each data table to be uploaded on the same day according to the name of the table to be analyzed to obtain second creating time;

s329: when the first creation time is the same as the second creation time, determining that the main key repetition judgment result is not repeated, otherwise, determining that the main key repetition judgment result is repeated;

s3210: and updating the data abnormal value check table according to the name of the table to be analyzed, the main key abnormal judgment result, the time abnormal judgment result and the main key repeated judgment result.

According to the embodiment, the data cache layer table name, the data pasting layer table and the data detail layer table corresponding to each to-be-uploaded data table in the data warehouse are updated, the data cache layer table name, the main key whether abnormal, the time whether abnormal, the main key repetition, the data synchronization time, the data creation time and the data updating time of the data abnormal value check table are updated, and a foundation is provided for subsequent abnormal value judgment of data.

For S321, the generation time of the data synchronization request is searched in each data synchronization time in the Hive metadata of the data warehouse, and each table name (table name corresponding to each to-be-uploaded data table) corresponding to the data synchronization time searched in the Hive metadata is used as a table name set to be analyzed.

For step S322, one table name in the table name set to be analyzed is sequentially acquired as the table name to be analyzed.

For S323, the name of the table to be analyzed is searched for a service primary key in the newly added data of the data cache layer base table corresponding to each data table to be uploaded on the current day, and the searched service primary key is used as a target service primary key.

For S324, when the target service primary key is a null character string or a null object, it means that the service primary key is abnormal, and therefore it may be determined that the primary key abnormality determination result is abnormal; when the target service key is not a null character string and is not a null object, it means that there is no anomaly in the service key, and thus it can be determined that the result of the determination of the anomaly in the key is normal.

For step S325, the table name to be analyzed is searched for creation time in the newly added data of the data cache layer table corresponding to each data table to be uploaded on the same day, and the searched time is used as a first creation time.

The creation time is the creation time in the data warehouse.

For S326, when the first creation time is an empty character string or an empty object, it means that the creation time update is abnormal, and therefore it may be determined that the time abnormality determination result is abnormal; when the first creation time is not an empty character string and is not an empty object, it means that there is no abnormality in the creation time update, and thus it can be determined that the time abnormality determination result is normal.

For S327, the table name to be analyzed is searched for a service primary key in the data source layer base table corresponding to each to-be-uploaded data table, and when the service primary key is found, it is determined that a service primary key search result is present, and when the service primary key is not found, it is determined that the service primary key search result is absent.

For step S328, when the service primary key search result is present, the table name to be analyzed is searched for creation time in the newly added data of the data source layer base table corresponding to each to-be-uploaded data table on the same day, and the searched creation time is used as second creation time.

For S329, when the first creation time and the second creation time are the same, determining that the primary key repetition determination result is not repeated; and when the first creation time is different from the second creation time, determining that the main key repetition judgment result is repetition.

For S3210, the table name to be analyzed, the primary key abnormality determination result, the time abnormality determination result, and the primary key repetition determination result are used as associated data, and the data abnormal value check table is updated with the associated data.

In an embodiment, the step of performing an abnormal judgment on the data volume according to the target time range and the data volume statistical table to obtain a data volume monitoring judgment result includes:

s51: acquiring each associated data to be analyzed from the data volume statistical table according to the target time range;

s52: taking any one of the associated data to be analyzed as target associated data;

s53: when the number of the source table increment data of the target associated data is different from the number of the data cache layer base table increment data, determining that an increment judgment result corresponding to the target associated data is abnormal, otherwise, determining that the increment judgment result is normal;

s54: when the sum of the total data number and the dirty data number of the data detail layer base table of the target associated data is different from the total data number of the data source layer base table of the target associated data, determining that the total judgment result corresponding to the target associated data is abnormal, otherwise, determining that the total judgment result is normal;

s55: generating an incremental trend graph according to the number of the incremental data of the data cache layer base table of each piece of the associated data to be analyzed;

s56: calculating the slope of the incremental trend graph to obtain a target slope;

s57: when the target slope is out of the obtained preset slope range, determining that an increment trend judgment result is abnormal, otherwise, determining that the increment trend judgment result is normal;

s58: and when the increment judgment result, the full judgment result and the increment trend judgment result are normal, determining that the data quantity judgment result of the data quantity monitoring judgment result is normal, otherwise, determining that the data quantity judgment result is abnormal, and determining the reason of the data quantity abnormality of the data quantity monitoring judgment result according to the abnormal increment judgment result, the abnormal full judgment result and the abnormal increment trend judgment result.

According to the embodiment, the increment judgment result, the full judgment result and the increment trend judgment result are used as the evaluation indexes to judge the data quantity abnormity, so that the data quantity of the data synchronized to the data warehouse is automatically monitored, the situation that the Hive table is accessed and the increment data quantity, the full data quantity and the dirty data quantity of the day are counted by adopting the counting function to monitor the abnormity of the data warehouse is avoided, and the timeliness of data monitoring is improved.

For S51, each related data in the target time range is obtained from the data amount statistical table, and each obtained related data is taken as one related data to be analyzed.

For S52, any one of the pieces of associated data to be analyzed is taken as target associated data.

For S53, when the number of the source table increment data of the target associated data is different from the number of the data cache layer base table increment data, determining that the increment judgment result corresponding to the target associated data is abnormal; and when the number of the source table increment data of the target associated data is the same as that of the data cache layer base table increment data, determining that the increment judgment result is normal.

For S54, when the sum of the total data number and the dirty data number of the data detail hierarchical library table of the target associated data is different from the total data number of the data source hierarchical library table of the target associated data, determining that the total judgment result corresponding to the target associated data is abnormal; and when the sum of the total data number and the dirty data number of the data detail layer base table of the target associated data is the same as the total data number of the data pasting layer base table of the target associated data, determining that the total judgment result is normal.

For S55, when generating the incremental trend graph, taking the number of the incremental data of the data cache layer library table of each piece of the associated data to be analyzed as a y-axis, and taking the generation time of the number of the incremental data of each data cache layer library table as an x-axis.

For S56, performing slope calculation corresponding to the number of incremental data of each data cache layer library table on the incremental trend graph, and taking each calculated slope as a target slope.

For S57, when the target slope is out of the obtained preset slope range, determining that the incremental trend judgment result is abnormal; and when the target slope is within the acquired preset slope range, determining that the incremental trend judgment result is normal.

For S58, the first set of evaluation metrics includes: an increment judgment result, a full quantity judgment result and an increment trend judgment result.

When the increment judgment result, the full judgment result and the increment trend judgment result are all normal, the data of which the updating time is within the target time range in the data volume statistical table meets all first evaluation indexes in a first evaluation index set, so that the data volume judgment result of the data volume monitoring judgment result is determined to be normal; when any one of the increment judgment result, the full-scale judgment result and the increment trend judgment result is abnormal, it means that the data of which the update time is within the target time range in the data volume statistical table does not satisfy all first evaluation indexes in a first evaluation index set, and thus it is determined that the data volume judgment result is abnormal; and when the data quantity judgment result is abnormal, taking the abnormal increment judgment result, the abnormal full quantity judgment result and the abnormal increment trend judgment result as the data quantity abnormal reasons of the data quantity monitoring judgment result. Thereby being beneficial to quickly finding out the cause of the abnormity.

In an embodiment, the step of determining an abnormal value of the data according to the target time range and the data abnormal value check table to obtain a data monitoring determination result includes:

s61: acquiring each to-be-processed associated data from the data abnormal value check table according to the target time range;

s62: taking any one of the associated data to be processed as associated data to be judged;

s63: and when the main key abnormity judgment result, the time abnormity judgment result and the main key repeated abnormity judgment result in the associated data to be judged are normal, determining that the data monitoring judgment result is normal, otherwise, determining that the data monitoring judgment result is abnormal.

The abnormal value judgment of the data is carried out by taking the main key abnormal judgment result, the time abnormal judgment result and the main key repeated abnormal judgment result as the evaluation indexes, so that the automatic monitoring of the abnormal value of the data synchronized to the data warehouse is realized, and the data monitoring timeliness is improved.

For S61, each related data in the target time range is obtained from the data outlier check table, and each obtained related data is taken as one to-be-processed related data.

For S62, any one of the pieces of to-be-processed associated data is taken as the to-be-determined associated data.

For step S63, when the primary key abnormality determination result, the time abnormality determination result, and the primary key repeated abnormality determination result in the associated data to be determined are all normal, determining that the data monitoring determination result is normal; and when any one of the main key abnormity judgment result, the time abnormity judgment result and the main key repeated abnormity judgment result in the associated data to be judged is abnormal, determining that the data monitoring judgment result is abnormal.

Referring to fig. 2, the present application further proposes a data monitoring apparatus of a data warehouse, the apparatus including:

a first request obtaining module 100, configured to obtain a data synchronization request of a target service system;

a first updating module 200, configured to respond to the data synchronization request, and update a data quantity statistics table according to the number of incremental data pieces of the source table to be stored and the number of full data pieces of the source table to be stored, which correspond to each data table to be uploaded in the target service system;

a second updating module 300, configured to respond to the data synchronization request, update a data warehouse according to each to-be-uploaded data table, and update the data volume statistics table and the data abnormal value check table according to data updated in the data warehouse;

a second request obtaining module 400, configured to obtain an exception judgment request, where the exception judgment request carries a target time range;

a data amount monitoring and judging result determining module 500, configured to perform abnormal judgment on the data amount according to the target time range and the data amount statistical table, so as to obtain a data amount monitoring and judging result;

and the data monitoring judgment result determining module 600 is configured to perform abnormal value judgment on the data according to the target time range and the data abnormal value check table to obtain a data monitoring judgment result.

Referring to fig. 3, a computer device, which may be a server and whose internal structure may be as shown in fig. 3, is also provided in the embodiment of the present application. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the computer designed processor is used to provide computational and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The memory provides an environment for the operation of the operating system and the computer program in the non-volatile storage medium. The database of the computer device is used for storing data such as data monitoring methods of a data warehouse. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of data monitoring of a data warehouse. The data monitoring method of the data warehouse comprises the following steps: acquiring a data synchronization request of a target service system; responding to the data synchronization request, and updating a data quantity statistical table according to the number of the incremental data pieces of the source table to be stored and the total data pieces of the source table to be stored, which are respectively corresponding to each data table to be uploaded in the target service system; responding to the data synchronization request, updating a data warehouse according to each data table to be uploaded, and updating the data volume statistical table and the data abnormal value check table according to the data updated to the data warehouse; acquiring an abnormality judgment request, wherein the abnormality judgment request carries a target time range; performing data quantity abnormity judgment according to the target time range and the data quantity statistical table to obtain a data quantity monitoring judgment result; and judging the abnormal value of the data according to the target time range and the data abnormal value check table to obtain a data monitoring judgment result.

An embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements a data monitoring method for a data warehouse, including the steps of: acquiring a data synchronization request of a target service system; responding to the data synchronization request, and updating a data quantity statistical table according to the number of the incremental data pieces of the source table to be stored and the total data pieces of the source table to be stored, which are respectively corresponding to each data table to be uploaded in the target service system; responding to the data synchronization request, updating a data warehouse according to each data table to be uploaded, and updating the data volume statistical table and the data abnormal value check table according to the data updated to the data warehouse; acquiring an abnormality judgment request, wherein the abnormality judgment request carries a target time range; performing data quantity abnormity judgment according to the target time range and the data quantity statistical table to obtain a data quantity monitoring judgment result; and judging the abnormal value of the data according to the target time range and the data abnormal value check table to obtain a data monitoring judgment result.

The data monitoring method of the executed data warehouse comprises the steps of firstly responding to the data synchronization request, updating a data quantity statistical table according to the number of incremental data pieces of the source tables to be stored and the number of full data pieces of the source tables to be stored, which are respectively corresponding to each data table to be uploaded in the target service system, then responding to the data synchronization request, updating the data warehouse according to each data table to be uploaded, updating the data quantity statistical table and a data abnormal value check table according to data updated in the data warehouse, finally performing abnormal judgment on the data quantity according to the target time range and the data abnormal value check table to obtain a data quantity monitoring judgment result, and performing abnormal value judgment on the data according to the target time range and the data abnormal value check table to obtain a data monitoring judgment result. Therefore, the data volume and the abnormal value of the data synchronized to the data warehouse are automatically monitored in real time, the situation that the Hive table is accessed and the incremental data volume, the full data volume and the dirty data volume of the same day are counted by adopting the counting function so as to monitor the abnormal data of the data warehouse is avoided, and the timeliness of data monitoring is improved.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium provided herein and used in the examples may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), double-rate SDRAM (SSRSDRAM), Enhanced SDRAM (ESDRAM), synchronous link (Synchlink) DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and bus dynamic RAM (RDRAM).

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, apparatus, article, or method that includes the element.

The above description is only a preferred embodiment of the present application, and not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings of the present application, or which are directly or indirectly applied to other related technical fields, are also included in the scope of the present application.

Claims

1. A method of data monitoring of a data warehouse, the method comprising:

acquiring a data synchronization request of a target service system;

2. The data monitoring method of the data warehouse according to claim 1, wherein the step of updating the data volume statistical table according to the number of the incremental data items of the source table to be stored and the number of the full data items of the source table to be stored, which correspond to each data table to be uploaded in the target service system, includes:

3. The method for data monitoring of a data warehouse of claim 1, wherein the step of updating the data volume statistics and data outlier check tables based on the data updated to the data warehouse comprises:

4. The data monitoring method of the data warehouse according to claim 3, wherein the step of updating the data volume statistical table according to Hive metadata corresponding to each data table to be uploaded in the data warehouse comprises:

5. The data monitoring method of the data warehouse according to claim 3, wherein the step of updating the data abnormal value check table according to the data cache tier base table, the data pasting layer base table and the data detail tier base table corresponding to each to-be-uploaded data table in the data warehouse comprises:

6. The data monitoring method of the data warehouse according to claim 1, wherein the step of performing the abnormal judgment of the data volume according to the target time range and the data volume statistical table to obtain the data volume monitoring judgment result comprises:

taking any one of the associated data to be analyzed as target associated data;

calculating the slope of the incremental trend graph to obtain a target slope;

7. The method for monitoring data in a data warehouse according to claim 1, wherein the step of determining abnormal values of data according to the target time range and the data abnormal value check table to obtain a data monitoring determination result comprises:

8. A data monitoring apparatus for a data warehouse, the apparatus comprising:

9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.