CN115510102A

CN115510102A - Data analysis rule generation method and device based on data architecture

Info

Publication number: CN115510102A
Application number: CN202211142707.7A
Authority: CN
Inventors: 刘晨; 孙星
Original assignee: Encore Beijing Information Technology Co ltd
Current assignee: Encore Beijing Information Technology Co ltd
Priority date: 2022-09-20
Filing date: 2022-09-20
Publication date: 2022-12-23

Abstract

The embodiment of the specification provides a data parsing rule generation method and device based on a data architecture, wherein the method comprises the following steps: establishing a data model based on a data architecture tool; acquiring data model information of the data model, identifying a use scene and a management theme of the data model based on the data model information, automatically creating different management scenes, and automatically matching different business rule templates; and automatically identifying the constraint conditions of the fields in the data model, automatically generating different analysis rules according to different constraint conditions, and generating an analysis result report.

Description

Data analysis rule generation method and device based on data architecture

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a data parsing rule generating method and device based on a data architecture.

Background

Data Profiling (Data Profiling) is a form of Data analysis used to examine Data and evaluate quality. Data profiling uses statistical techniques to discover the true structure, content, and quality of a data set (Olson, 2003). The profiling engine generates statistical information that an analyst is a pattern in that can use to identify data content and structure. For example:

1) The number of null values. It is identified that a null value exists and it is checked whether a null value is allowed.

2) Max/min values. Outliers, such as negative values, are identified.

3) Maximum/minimum length. An outlier or invalid value for a field having a particular length requirement is determined.

4) Frequency distribution of individual column values. The rationality can be evaluated (e.g., country code distribution of transactions, checking for frequently or infrequently occurring values, and percentage of records populated with default values).

5) Data type and format. Identifying levels that do not meet format requirements, and unexpected format identification (e.g., decimal places, embedded spaces, sample values).

Profiling also includes cross-column analysis, which can identify overlapping or duplicate columns and expose intrinsic dependencies of values. The inter-table analysis explores overlapping value sets and helps identify foreign key relationships. Most data analysis tools allow for deep analysis of data for further investigation.

The analyst must evaluate the results of the profiling engine to determine if the data meets the rules and other requirements. A good analyst may use the analysis results to validate known relationships and discover hidden features and patterns within and between data sets, including business rules and validity constraints. Profiling is typically used as part of data discovery in projects (particularly data integration projects) or to assess the current state of data to be improved. The data profiling results can be used to identify opportunities that can improve the quality of data and metadata (0 lson, 2003.

While profiling is an effective way to understand data, it is only the first step in improving the quality of data, which enables an organization to identify potential problems. Other forms of analysis are also needed to solve the problem, including business process analysis, data margin analysis, and more in-depth data analysis, which help isolate the root cause of the problem.

With the gradual popularization of enterprise data transformation, more and more systems are built in enterprises, and the data volume generated by the systems is larger and larger, so that the quality of mass data is guaranteed, the true value of the data is exerted, and the mass data is generally concerned by people.

One of the key technologies for improving data quality is to check data periodically according to a predetermined parsing rule. And manually combing according to the checking result or report, screening problem data, and rectifying.

The generation of the analysis rule mainly depends on a system builder or related technical personnel at present, and manual compiling is performed according to historical experience and the knowledge of system data and service indexes, so that the process not only needs a large amount of labor and time cost, but also is easy to miss and make mistakes in the compiling process, and the data quality monitoring is not comprehensive. In addition, the result checked according to the rule is not necessarily fed back to the relevant personnel timely and effectively, so that the same quality problem may occur repeatedly.

Disclosure of Invention

The present invention provides a data parsing rule generating method and device based on a data architecture, and aims to solve the above problems in the prior art.

The invention provides a data analysis rule generation method based on a data architecture, which comprises the following steps:

establishing a data model based on a data architecture tool;

acquiring data model information of the data model, identifying a use scene and a management theme of the data model based on the data model information, automatically creating different management scenes, and automatically matching different business rule templates;

and automatically identifying the constraint conditions of the fields in the data model, automatically generating different analysis rules according to different constraint conditions, and generating an analysis result report.

The invention provides a data analysis rule generating device based on a data architecture, which comprises:

the establishing module is used for establishing a data model based on the data architecture tool;

the matching module is used for acquiring data model information of the data model, identifying a use scene and a management theme of the data model based on the data model information, automatically creating different management scenes and automatically matching different business rule templates;

and the generation module is used for automatically identifying the constraint conditions of the fields in the data model, automatically generating different analysis rules according to different constraint conditions and generating an analysis result report.

An embodiment of the present invention further provides an electronic device, including: the data analysis system comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the computer program realizes the steps of the data analysis rule generation method based on the data architecture when being executed by the processor.

The embodiment of the present invention further provides a computer-readable storage medium, where an implementation program for information transfer is stored, and when the program is executed by a processor, the method implements the steps of the data parsing rule generation method based on the data architecture.

By adopting the embodiment of the invention, the effects of reducing labor and time cost and improving the checking coverage rate and accuracy rate can be achieved.

Drawings

In order to more clearly illustrate one or more embodiments or prior art solutions of the present specification, the drawings that are needed in the description of the embodiments or prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present specification, and that other drawings can be obtained by those skilled in the art without inventive exercise.

FIG. 1 is a flow chart of a data profiling rule generation method based on a data architecture according to an embodiment of the present invention;

FIG. 2 is a flowchart of a detailed process of a data profiling rule generation method based on a data architecture according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a data profiling rule generating apparatus based on a data architecture according to an embodiment of the present invention;

fig. 4 is a schematic diagram of an electronic device of an embodiment of the invention.

Detailed Description

In order to make those skilled in the art better understand the technical solutions in one or more embodiments of the present disclosure, the technical solutions in one or more embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in one or more embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, and not all embodiments. All other embodiments that can be derived by a person skilled in the art from one or more of the embodiments described herein without making any inventive step shall fall within the scope of protection of this document.

Method embodiment

According to an embodiment of the present invention, a data parsing rule generating method based on a data architecture is provided, fig. 1 is a flowchart of the data parsing rule generating method based on the data architecture according to the embodiment of the present invention, and as shown in fig. 1, the data parsing rule generating method based on the data architecture according to the embodiment of the present invention specifically includes:

step S101, establishing a data model based on a data architecture tool; the data architecture tool is as follows: and the data modeling tool is used for assisting professional modelers to draw a data logic model and a physical model and can generate DDL statements.

Step S102, acquiring data model information of the data model, identifying a use scene and a management theme of the data model based on the data model information, automatically creating different management scenes, and automatically matching different business rule templates;

and step S103, automatically identifying constraint conditions of fields in the data model, automatically generating different analysis rules according to different constraint conditions, and generating an analysis result report. Step S103 specifically includes: when the field is a primary key field, automatically generating a non-null and unique value parsing rule, and when the primary key field is an attribute, only checking whether a repeated value exists in the field and whether a null value exists; when the primary key field is a plurality of attributes, simultaneously judging whether a plurality of fields have repeated values and whether null values simultaneously exist in the fields; when the field is a non-null field, automatically generating an analysis rule of a null value check class, namely checking whether the field is null or not; and when the field is a relation/foreign key, automatically generating a correlation analysis rule, and generating correlation verification through a dependency relationship established by the relation.

After step S103, the profiling results report is displayed back into the data architecture tool.

The above-described technical means of the embodiments of the present invention will be described in detail below.

As shown in fig. 2, the method specifically includes the following steps:

step 1, establishing a data model by means of a data architecture tool. The data architecture tool can be understood as a data modeling tool and a software tool which is used for assisting professional modeling personnel to draw a data logic model and a physical model and can generate DDL statements. The modeling tools that are currently used in the industry are: powerDesigner, ERWin, domestic relatively mature also have: datablau-DDM, weavertir, and the like.

And 2, acquiring data model information, identifying the use scene and the management theme of the data model, automatically creating different management scenes, and automatically matching different business rule templates. For example, a data model governing the submission topic-related system is initialized, and business rules are configured based on the model.

And 3, automatically identifying constraint conditions of fields in the model, and automatically generating different analysis rules according to different constraint conditions, such as:

1. "Primary Key field": a profiling rule is automatically generated for non-null and unique values.

When the primary key field is an attribute, it is checked only whether there is a duplicate value in the field and whether there is a null value.

When the primary key field is a plurality of attributes, it is necessary to simultaneously determine whether duplicate values exist in the plurality of fields and whether null values exist in the plurality of fields at the same time.

"non-empty field": and automatically generating the parsing rule of the null value check class. I.e. check if the field is empty (no data entered).

"relationship/foreign bond": and automatically generating the associated parsing rule. And generating relevance verification through the dependency relationship established by the relationship. For example, if there is an order form and a customer form, the order form has customer information, and at this time, the customer information in the order form needs to be checked to see if it exists in the customer form.

And 4, finally, displaying the analysis result report back to the data architecture tool for reference of data architecture personnel.

According to the embodiment of the invention, the data architecture information is acquired through the data architecture tool, the constraint conditions of the fields in the data architecture are identified, and the data analysis rule can be automatically and efficiently generated according to different scenes and the constraint conditions, so that the aims of reducing labor and time costs and improving the coverage rate and accuracy rate are fulfilled.

Apparatus embodiment one

According to an embodiment of the present invention, a data parsing rule generating device based on a data architecture is provided, fig. 3 is a schematic diagram of the data parsing rule generating device based on a data architecture according to the embodiment of the present invention, as shown in fig. 3, the data parsing rule generating device based on a data architecture according to the embodiment of the present invention specifically includes:

an establishing module 30 for establishing a data model based on the data architecture tool; the data architecture tool is as follows: the data modeling tool is used for assisting professional modeling personnel to draw a data logic model and a physical model and can generate DDL statements.

The matching module 32 is configured to obtain data model information of the data model, identify a usage scenario and a management topic of the data model based on the data model information, automatically create different management scenarios, and automatically match different business rule templates;

and the generating module 34 is configured to automatically identify constraint conditions of fields in the data model, and automatically generate different parsing rules according to different constraint conditions to generate a parsing result report. The generating module 34 is specifically configured to:

when the field is a primary key field, automatically generating a non-null and unique value parsing rule, and when the primary key field is an attribute, only checking whether a repeated value exists in the field and whether a null value exists; when the primary key field is a plurality of attributes, simultaneously judging whether a plurality of fields have repeated values and whether null values simultaneously exist in the fields;

when the field is a non-null field, automatically generating a parsing rule of a null value check class, namely checking whether the field is null or not;

and when the field is a relation/foreign key, automatically generating a correlation analysis rule, and generating correlation verification through a dependency relationship established by the relation.

The above apparatus may further include: and the back display module is used for displaying the analysis result report back to the data architecture tool.

Compared with the prior art, the technical scheme provided by the invention automatically generates the data parsing rule according to the data architecture. After the technology is adopted, the labor cost and the time cost are obviously reduced, and the coverage rate and the accuracy rate of the check are obviously improved. In practical use, the invention has low learning cost and obviously improved efficiency, and meets the application requirement.

The embodiment of the present invention is an apparatus embodiment corresponding to the above method embodiment, and specific operations of each module may be understood with reference to the description of the method embodiment, which is not described herein again.

Device embodiment II

An embodiment of the present invention provides an electronic device, as shown in fig. 4, including: a memory 40, a processor 42 and a computer program stored on the memory 40 and executable on the processor 42, which computer program when executed by the processor 42 performs the steps as described in the method embodiments.

Device embodiment III

An embodiment of the present invention provides a computer-readable storage medium, on which an implementation program for information transmission is stored, and when executed by the processor 42, the program implements the steps as described in the method embodiment.

The computer-readable storage medium of this embodiment includes, but is not limited to: ROM, RAM, magnetic or optical disks, and the like.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A data analysis rule generation method based on a data architecture is characterized by comprising the following steps:

establishing a data model based on a data architecture tool;

2. The method of claim 1, further comprising:

displaying the profiling results report back into the data architecture tool.

3. The method of claim 1, wherein the data structuring tool is: the data modeling tool is used for assisting professional modeling personnel to draw a data logic model and a physical model and can generate DDL statements.

4. The method of claim 1, wherein automatically identifying constraints for fields in the data model and automatically generating different parsing rules based on different constraints specifically comprises:

and when the field is a relation/foreign key, automatically generating a correlation analysis rule, and generating correlation verification through a dependency relation established by the relation.

5. A data analysis rule generation device based on a data architecture is characterized by comprising:

6. The apparatus of claim 5, further comprising:

and the back display module is used for displaying the analysis result report back to the data architecture tool.

7. The apparatus of claim 5, wherein the data structuring tool is: the data modeling tool is used for assisting professional modeling personnel to draw a data logic model and a physical model and can generate DDL statements.

8. The apparatus of claim 5, wherein the generation module is specifically configured to:

when the field is a non-null field, automatically generating an analysis rule of a null value check class, namely checking whether the field is null or not;

9. An electronic device, comprising: memory, a processor and a computer program stored on the memory and executable on the processor, the computer program when executed by the processor implementing the steps of the data architecture based data profiling rule generating method according to any of claims 1 to 4.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores thereon an implementation program of information transfer, which when executed by a processor implements the steps of the data profiling rule generating method based on data architecture of any claim 1 to 4.