CN117271553A

CN117271553A - Method for generating and operating supervision report data quality rule

Info

Publication number: CN117271553A
Application number: CN202311165758.6A
Authority: CN
Inventors: 铁锦程; 李虎; 陈嘉; 王尧; 朱建兵; 何婷
Original assignee: Shanghai Pudong Development Bank Co Ltd
Current assignee: Shanghai Pudong Development Bank Co Ltd
Priority date: 2023-09-08
Filing date: 2023-09-08
Publication date: 2023-12-22

Abstract

The invention relates to a method for generating and operating a supervision report data quality rule, which comprises the following steps: s1, acquiring supervision rule sample data and preprocessing the sample data; s2, constructing a neural network model, and training the neural network by using the preprocessed sample data to obtain a rule classification model; s3, preprocessing the rule data to be processed, inputting a rule classification model, and outputting a classification result; s4, based on the classification result, associating and acquiring corresponding SQL template data from a preset database and generating rule script information; s5, repeatedly executing the steps S3-S4 until rule script information corresponding to all the rule data to be processed is generated; and S6, calculating all the generated rule script information in batches, and transmitting the operation result to the supervision reporting system. Compared with the prior art, the method and the device can efficiently and accurately automatically generate and operate the data quality rule, thereby improving the data reporting quality and ensuring the service stability.

Description

Method for generating and operating supervision report data quality rule

Technical Field

The invention relates to the technical field of computer data processing, in particular to a method for generating and operating a supervision report data quality rule.

Background

With the rapid development of the information age, the business types and the quantity of each financial enterprise are more, and the data quality requirements of each supervision department on the supervision and reporting of the financial enterprises are gradually improved. At present, most technicians of financial enterprises bear verification and configuration work of supervision and report data quality rules, in order to ensure the data quality reported by the enterprises, part of the enterprises specially arrange double persons to write quality rule codes manually to perform cross verification of the quality rules, and the link needs to manually understand supervision rule semantics in depth in advance and then write the codes according to personal understanding for verification; some enterprises adopt regular expression matching to carry out single matching on supervision rules or carry out automatic analysis by using reinforcement learning in similar business scenes so as to ensure the quality of data submitted by supervision.

However, the existing manual and regular expression matching methods have the following disadvantages:

(1) The time consumption for manual carding is long: the supervision rule has large data volume, and the manual carding rule has the problems of long carding time consumption, slow progress, easy error and the like.

(2) And the manual carding cost is high: the supervision rules are various, the complexity is high, and the labor cost for technicians to comb different rules and complete the writing of SQL (Structured Query Language ) codes is high; and the technical capability of different personnel is uneven, and the quality of the result of manual carding cannot be ensured accurately.

(3) Poor adaptability: at present, some enterprises are classified in similar business scenes by adopting a regular matching technology, but regular expressions are multiple in variety and complex in grammar, the possibility of regular multi-layer nesting is high, and good automatic classification cannot be realized for new supervision rules.

It can be said that it is difficult to efficiently and accurately automatically generate and run data quality rules in the prior art, which results in adverse effects on the quality of data delivery and service stability.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a method for generating and operating the supervision report data quality rule, which can efficiently and accurately automatically generate and operate the data quality rule, thereby improving the data report quality and ensuring the service stability.

The aim of the invention can be achieved by the following technical scheme: a method for generating and operating a supervision report data quality rule comprises the following steps:

s1, acquiring supervision rule sample data and preprocessing the sample data;

s2, constructing a neural network model, and training the neural network by using the preprocessed sample data to obtain a rule classification model;

s3, preprocessing the rule data to be processed, inputting a rule classification model, and outputting a classification result;

s4, based on the classification result, associating and acquiring corresponding SQL template data from a preset database and generating rule script information;

s5, repeatedly executing the steps S3-S4 until rule script information corresponding to all the rule data to be processed is generated;

and S6, calculating all the generated rule script information in batches, and transmitting the operation result to the supervision reporting system.

Further, the specific process of step S1 is as follows: acquiring supervision rule sample data, and performing illegal character preprocessing on the supervision rule sample data; and then carrying out word segmentation on the sample data, and simultaneously carrying out stop word processing on the word segmentation result.

Further, the step S1 specifically comprises the steps of performing illegal character preprocessing on supervision rule sample data by using a regular expression;

word segmentation is carried out by using a word segmentation method based on dictionary and word bank matching or using a barker word segmentation tool.

Further, the neural network model constructed in the step S2 includes an input layer, a hidden layer and an output layer, wherein the input of the input layer is a feature vector obtained by converting the preprocessed sample data, and the number of nodes of the input layer is greater than or equal to the dimension of the feature vector; each node in the output layer represents a classification result.

Further, the specific process of training the neural network in step S2 is as follows:

initializing neural network parameters;

converting the preprocessed sample data into feature vectors, transmitting the feature vectors to an input layer, and transmitting corresponding known classification labels, namely a given target, to an ideal output unit;

acquiring node output of an output layer and a hidden layer;

calculating the deviation and calculation error of the target value and the actual output so as to update the weight;

judging whether the current neural network model meets the preset precision requirement, if so, stopping training to obtain a rule classification model, and otherwise, returning to continue training.

Further, the database preset in step S4 stores SQL template mapping information corresponding to different supervision rules, specifically, key-value mapping.

Further, the step S4 specifically includes the following steps:

s41, taking the classification result as a key, and connecting database association mapping file data through a database connection pool to obtain corresponding SQL template data;

s42, automatically filling the SQL template data with dynamic parameters to generate complete SQL verification rule script information.

Further, in step S6, all the generated rule script information is submitted to the big data cluster for calculation, so as to obtain the result status information of each rule operation, and when all the result status information is "pass", the current corresponding data information is transmitted to the supervision reporting system.

Further, the big data cluster comprises a distributed computing unit and a distributed storage unit, wherein the distributed computing unit is used for performing distributed computation on the received rule script information; the distributed storage unit is used for storing metadata information.

Further, the step S6 specifically includes the following steps:

s61, storing each generated rule script information into a queue of a thread pool as a single task;

s62, submitting tasks in the queue to a big data cluster in batches through interface call for distributed calculation, and obtaining result data of each rule operation through an interface callback mode after the big data cluster is calculated;

and S63, comparing the result data with preset expected data, wherein if the result data is consistent with the preset expected data, the result data is passed, if the result data is inconsistent with the preset expected data, the result data is failed, and if all the rule running states are passed, the corresponding table data information is uploaded to the supervision reporting system.

Compared with the prior art, the invention has the following advantages:

1. according to the invention, the neural network model is built, the neural network is trained by utilizing the preprocessed supervision rule sample data, so that a rule classification model is obtained, automatic identification and classification can be carried out on supervision rules, rule script information can be automatically generated by combining a mode of acquiring SQL template data in a correlation mode, and then the generated rule script information is automatically operated in a batch calculation mode. The neural network language classification model is supported, and the supervised autonomous learning is carried out through the sample library, so that the manual understanding of massive rule semantics and the participation in the complicated supervision rule code writing are not required, the learning of complex regular expression grammar is also not required, and the labor cost is greatly reduced; and the method can well automatically identify and classify aiming at the new supervision rule, and generate an expected rule script, so that the accuracy of quality rule codes can be greatly improved, and the quality of data reporting and the service stability are ensured.

2. In the invention, the similar SQL sentences are considered, and only the table names, field names or filtering conditions related in the SQL sentences have slight differences, so that the similar SQL sentences are extracted into templates, namely, the supervision rules are classified, then the key value pair mapping is carried out with the corresponding SQL templates, and then the mapping result is stored in a database, so that the one-to-one mapping of the supervision rules and the SQL templates is realized.

3. According to the method, all the generated rule script information is calculated in batches, and in a distributed mode of the big data cluster, manual copying and pasting code execution are not needed, so that index calculation can be quickly carried out on massive report data, the real-time performance of report is improved, and the quality of data report is ensured.

Drawings

FIG. 1 is a schematic flow chart of the method of the present invention;

FIG. 2 is a schematic diagram of an application process of an embodiment;

FIG. 3 is a diagram showing the word segmentation result in the embodiment;

FIG. 4 is a schematic diagram of a neural network model constructed in an embodiment;

FIG. 5 is a schematic diagram of different rule classification labels according to an embodiment;

FIG. 6 is a schematic diagram of a training process for a neural network model;

FIG. 7 is a schematic diagram of a process for generating regular script information;

fig. 8 is a schematic diagram of a system architecture constructed in an embodiment.

Detailed Description

The invention will now be described in detail with reference to the drawings and specific examples.

Examples

As shown in fig. 1, a method for generating and running a quality rule of supervision report data includes the following steps:

s1, acquiring supervision rule sample data and preprocessing the sample data;

By applying the technical scheme, as shown in fig. 2, the embodiment mainly includes:

step 1, sample data preprocessing

Performing illegal character preprocessing on a large number of supervision rule sample data by using a regular expression to prevent special characters from generating special samples so as to interfere with the training effect of the model; after the sample data is subjected to regular pretreatment, performing word segmentation on the sample data again, and stopping words and the like on the segmented result; the training data of the model consists of training samples and classification labels.

The key step of the subsequent training of the neural network is to ensure the correctness of the sample, and a good neural network can be trained by using correct sample data, so that the sample data must be preprocessed before training to obtain a good learning sample; according to the scheme, the regular expression is used for carrying out illegal character processing on the text data, so that the phenomenon that special characters generate special samples to interfere with a model training effect is prevented, and illegal characters such as "#", etc. are processed in a preprocessing stage; secondly, based on dictionary and word library matching word segmentation methods, carrying out word segmentation processing on sample data (text data can be directly segmented by using a barking word segmentation tool); the supervision rules may be classified into null value verification, enumeration value verification, length verification, and the like, and the word segmentation result in this embodiment is shown in fig. 3.

In order to improve efficiency, for common words that provide little useful information, such as "what appears in sentences", "has been" and the like, the present embodiment uses subsampling technique to improve training speed and accuracy of word vectors, and a common processing method is to give a threshold parameter t of word frequency, where word w is to be discarded with prob probability, and f (w) is the frequency of w, where the formula is:

step 2, neural network model input

According to the method, a deep learning method is adopted to automatically classify the text, so that all words after word segmentation are required to be changed into feature vectors to serve as input of a neural network model, the number of input neural nodes is adjusted according to the dimension of the feature vectors, and n input layer neurons are set in order to ensure that each sample has corresponding neurons.

Training sample data of the neural network language classification model consists of a sample word stock and classification labels, so that all words in a document are changed into word vectors by using FastTest to be used as input of the neural network model, the design of the number of input layer nodes is required to be set according to the input quantity of an actual problem, the number of the input neural nodes is adjusted according to the dimension of a feature vector, and n input layer neurons are set for ensuring that each sample has corresponding neurons; in order to accelerate the learning process, feature vectors are typically normalized to between 0 and 1 first when they are passed into the input layer.

Step 3, neural network model output

The input layer and the output layer of the neural network are respectively one layer, and the hidden layer can be provided with a plurality of layers; each node in the last layer represents an output node, and each node represents a class, and then the n nodes represent n classes of models, for example, when there are 3 output nodes, there are 3 judgment segments of different classes, and there are 100 output nodes and 100 judgment segments of different classes. The neural network language model constructed in this embodiment is shown in fig. 4, where each node of the neural network output layer represents an output node, each node represents a class, and then the n nodes represent models of n classes. The corresponding class label is shown in fig. 5, assuming that there are fixed value check, null value check, format check, enumerated value check, length check, etc. In addition, the excitation function is arranged at the last layer of the neural network and is used as an output node of the prediction classification, each node represents one classification, and the excitation function of any output node is the same.

The excitation function is specifically:

the training process of the whole neural network model is shown in fig. 6, and a rule classification model is obtained by a self-supervision learning mode.

Step 4, one-to-one mapping of supervision rules and SQL templates

According to the past test experience, the scheme discovers that the SQL sentences corresponding to most supervision rules are similar, and only the table names, field names or filtering conditions related to the SQL sentences have small differences, so that the similar SQL sentences are designed to be extracted into templates, namely the supervision rules are classified, key-value mapping is carried out on the supervision rules and the corresponding SQL templates, and the mapping result is stored in a MySQL database as dynamic configuration, so that manual modification and program identification and calling are facilitated.

Step 5, intelligent processing of the existing supervision rules

And calling the trained neural network language classification model, performing natural language processing on the existing rule data to be processed, automatically returning processed classification result data after the model processing, taking the type calculated by the model as a key, and removing database association mapping file data to obtain corresponding SQL template data.

As shown in fig. 7, the automation program reads the text data of the to-be-processed supervision rule in advance, and then invokes the rule classification model to perform natural language processing on the existing to-be-processed rule data to obtain processed classification result data, for example: the rule data to be processed is 'transaction date is not allowed to be null', after natural language processing, a null value check type is output, the type calculated by the model is used as a key, mySQL database association mapping file data are connected through a database connection pool, corresponding SQL template data are obtained, dynamic parameters such as table names/fields and the like are automatically filled in the SQL template data, and a complete SQL verification rule script code is generated.

Step 6, big data cluster distributed computation and visual display of results

And submitting the generated verification rules to a big data cluster for calculation, and obtaining result state information of each rule operation after calculation, wherein in practical application, the result state information can be displayed on a front page, a user can further intuitively know whether the rule passes or fails according to the state information of the page, and if all rule operation states pass, the table data information corresponding to the rule is uploaded to a supervision and reporting system, so that reporting of supervision data can be completed.

Based on the above scheme, the embodiment builds a system architecture including a local system, a distributed computing platform and a reporting system, as shown in fig. 8, where the local system is configured to execute a process of automatically generating rule script codes, and the distributed computing platform is configured to perform distributed computation on all received rule script codes, and then transmit data information that all rule running state information passes through to the reporting system in a manner of timing scheduling.

When distributed computation is performed, each generated SQL verification rule is stored into a queue of a thread pool as an independent task, tasks in the queue are submitted to a big data cluster in batches through interface calls to perform distributed computation, result data of each rule operation is obtained through an interface callback mode after the big data cluster is computed, the result data is compared with preset expected data, the result data is passed if the comparison is inconsistent, the comparison is failed, the result is displayed on a front web page as state information, a user can further intuitively know whether the rule is passed or not according to the state information of the page, when all rule operation states are passed, the table data information corresponding to the rule is uploaded to a supervision and report system, and as the operation result is displayed in batches on a front page, the user can rapidly and accurately locate whether a certain rule normally passes or not, and complexity of problem investigation is reduced.

In summary, the technical scheme aims to solve the problem of inaccurate quality of reporting of the supervision data, automatically identifies the quality requirement of the reporting data through AI semantic analysis of the supervision quality rule, and then classifies the quality rule. The automatic testing tool can customize the expandable SQL detection templates according to different classifications, automatically generate a quality rule verification script by combining metadata information of the enterprise data warehouse model and the big data platform, schedule computing resources and storage resources of the big data cluster to execute rules and visually display, so that instantaneity of reporting is improved, and quality of data reporting is ensured.

According to the scheme, deep learning is combined with an automatic program, automatic analysis of supervision quality rules and automatic generation of supervision rule codes can be achieved, the generated rules can be directly applied to the supervision and report field, the learning threshold of data report personnel is low, the supervision rule semantics are not required to be understood, and rule combing can be learned in a short time.

According to the scheme, the supervision rule codes are automatically operated in a distributed computing mode, manual copying and pasting code execution are not needed, and index computation can be quickly carried out on mass report data.

The scheme adopts a mode of customizing the classification templates and automatically generating rules, is particularly suitable for scenes with large supervision rules and high rule complexity, and can effectively solve the problems of long time consumption, easy error and the like of the traditional manual carding.

Claims

1. The method for generating and operating the supervision report data quality rule is characterized by comprising the following steps:

s1, acquiring supervision rule sample data and preprocessing the sample data;

2. The method for generating and running the supervision report data quality rule according to claim 1, wherein the specific process of step S1 is as follows: acquiring supervision rule sample data, and performing illegal character preprocessing on the supervision rule sample data; and then carrying out word segmentation on the sample data, and simultaneously carrying out stop word processing on the word segmentation result.

3. The method for generating and running the quality rule of the supervision report data according to claim 2, wherein the step S1 is specifically to use a regular expression to perform illegal character preprocessing on the supervision rule sample data;

4. The method for generating and operating the quality rule of the supervision report data according to claim 1, wherein the neural network model constructed in the step S2 includes an input layer, a hidden layer and an output layer, the input of the input layer is a feature vector obtained by converting the preprocessed sample data, and the number of nodes of the input layer is greater than or equal to the dimension of the feature vector; each node in the output layer represents a classification result.

5. The method for generating and running the supervision report data quality rule according to claim 4, wherein the specific process of training the neural network in step S2 is as follows:

initializing neural network parameters;

acquiring node output of an output layer and a hidden layer;

6. The method for generating and running the quality rule of the supervision report data according to claim 1, wherein the database preset in step S4 stores SQL template mapping information corresponding to different supervision rules, specifically key-value mapping.

7. The method for generating and running the supervision report data quality rule according to claim 6, wherein the step S4 specifically includes the following steps:

8. The method for generating and running the quality rule of the supervision and reporting data according to claim 1, wherein the step S6 is specifically to submit all the generated rule script information to a big data cluster for calculation, so as to obtain the result state information of each rule operation, and when all the result state information is "pass", the current corresponding data information is transmitted to the supervision and reporting system.

9. The method for generating and running the supervision report data quality rule according to claim 8, wherein the big data cluster comprises a distributed computing unit and a distributed storage unit, and the distributed computing unit is used for performing distributed computation on the received rule script information; the distributed storage unit is used for storing metadata information.

10. The method for generating and running the supervision report data quality rule according to claim 8, wherein the step S6 specifically includes the following steps: