CN117332286A

CN117332286A - System, method and device for data mapping verification

Info

Publication number: CN117332286A
Application number: CN202311272327.XA
Authority: CN
Inventors: 郑清正
Original assignee: Jiangsu Suning Bank Co Ltd
Current assignee: Jiangsu Suning Bank Co Ltd
Priority date: 2023-09-28
Filing date: 2023-09-28
Publication date: 2024-01-02

Abstract

The invention discloses a system, a method and a device for data mapping verification, comprising the following steps: the normalization module is used for acquiring data source information, preprocessing historical data and newly added data, and constructing a sample set; the data clustering module is used for acquiring data of the sample set, carrying out clustering treatment on the sample set and constructing a mapping dictionary; the data multi-classification module is used for training the prediction model, inputting the newly added field information into the trained prediction model, outputting a mapping relation according to the prediction result, if the prediction result is null, establishing the mapping relation of the newly added field information, generating a mapping data set according to the mapping relation, and updating the mapping dictionary and the prediction model. The system, the method and the device for data mapping verification solve the mapping problem of field names and field contents with similarity, train a prediction model according to the existing data, further realize the automation of the mapping relation, reduce the time consumption of manual operation and improve the working efficiency.

Description

System, method and device for data mapping verification

Technical Field

The invention belongs to the technical field of data mapping verification, and particularly relates to a system, a method and a device for data mapping verification.

Background

Financial enterprises such as banks have a need to connect to different external data service providers. There are scenarios of the same class, but with simultaneous or sequential access to different data sources. The data field structures provided by different service providers are both similar and dissimilar. Such as homogeneous fields, but code value specifications are not consistent, such as partial fields are not consistent. Such data needs to be fused together from the standpoint of unified management and maintenance of the data. The variability among them is time consuming and laborious through manual combing. When the data management works in carding financial business standards, similarity judgment and merging optimization of standard definition of the same business and different products exist. It is also desirable to use a similarity check and fusion analysis technique for the data source fields.

The reference method of the prior art is as follows: the method comprises the following steps: patent number CN114462421a, matching is performed using the similarity of the data table and the fields. The method comprises the steps of carrying out semantic recognition on table names and field names of a data source and a destination to obtain data source semantics and destination semantics; similarity comparison is carried out on the semantics of each field of each data source and the semantics of all fields of the destination end, so that a semantic similarity list of each field of the corresponding data source is obtained; determining the mapping relation between the data source and the destination terminal from a mapping rule set according to the semantic similarity list; storing all the mapping relations into a mapping relation library; judging whether all mapping relations in the mapping relation library are reasonable, if not, giving an alarm and waiting for manual intervention; and (5) incorporating the mapping relation confirmed by the manual stem prognosis into a mapping rule set. In a second method, patent numbers CN115729935B,2022 provide a data interaction processing method and system based on an ORM framework. The method adopts the relevant configuration of the data source to be converted into a rule for adapting the data source to read; and constructing the data of different data sources into data types of unified rules to obtain unified data. The two modes cannot be directly applied to the data fusion processing scene with certain similarity but difference, and the rule mapping processing is used, so that more time is needed to be invested to make mapping definition for each field.

Therefore, a way is needed to realize certain automatic mapping processing aiming at management and data standardization processing of similar data sources, and reduce the time consumption of manual one-to-one mapping processing.

Disclosure of Invention

The invention aims to provide a system, a method and a device for verifying data mapping, which are used for solving the problems that management of similar data sources and data standardization processing need manual processing, so that the mapping processing is more time-consuming and lower in efficiency.

In order to achieve the above purpose, the present invention provides the following technical solutions: a system for data mapping verification, comprising:

the normalization module is used for acquiring data source information, wherein the data source information comprises historical data and newly-added data, preprocessing the historical data and the newly-added data respectively to obtain historical field information and newly-added field information, selecting samples of the historical field information, and constructing a sample set;

the data clustering module is used for acquiring data of the sample set, carrying out clustering treatment on the sample set to obtain a clustering result, storing the mapping clustering result, cleaning the mapping clustering result with similarity, and constructing a mapping dictionary;

the data multi-classification module is used for acquiring the mapping relation in the mapping dictionary, extracting and fusing the characteristics of the mapping relation, training a prediction model, inputting newly added field information into the trained prediction model to obtain a prediction result, outputting the mapping relation, if the prediction result is null, establishing the mapping relation of the newly added field information, generating a mapping data set according to the mapping relation, and updating the mapping dictionary and the prediction model.

Preferably, the history data and the newly added data each include a field name, a field content and field data,

the normalization module comprises:

the field name preprocessing module is used for cleaning the fields of the field names to obtain standard field names;

the field content preprocessing module is used for carrying out field cleaning on the field content to obtain standard field content and constructing a sample set of the standard field content;

the field data preprocessing module is used for carrying out field data duplication elimination and counting field data;

and the field merging module is used for merging the standard field name, the standard field content and the field data into a fusion character string and carrying out vectorization processing on the character string.

Preferably, the clustering result includes a point cluster and noise points,

the data clustering module comprises:

and a cluster calculation module: the method comprises the steps of calculating data of a sample set, generating a clustering result, and identifying point clusters and noise points in the clustering result;

the feature mapping module is used for mapping the clustering result, mapping the point clusters and the noise points and constructing a mapping dictionary according to the mapping relation;

the manual intervention module is used for providing a port for manual operation;

the data verification module is used for verifying the mapping relation in the feature mapping module:

responding to the noise point checking command, inputting the noise point to the feature mapping module through the manual intervention module, and if the existing mapping relation does not exist, establishing a new mapping relation;

and responding to the field information checking command, judging the similarity of the standard field names through a manual intervention module, and manually determining the mapping relation of the standard field names with the similarity but different meanings.

Preferably, the data multi-classification module includes:

the model training module is used for acquiring the mapping relation and the corresponding standard field names as characteristics, fusing the characteristics by converting the standard field names, inputting the fused characteristics into the prediction model, and training the prediction model;

and the data updating module is used for acquiring newly added field information with the empty prediction result, inputting the newly added field information into the data clustering module, updating the mapping relation of the newly added field information, and updating the mapping dictionary and the prediction model.

A method of data mapping verification, comprising:

acquiring historical data, preprocessing the historical data to obtain historical field information, and performing sample selection on the historical field information to construct a sample set;

based on the data of the sample set, carrying out clustering treatment on the sample set to obtain a clustering result, mapping the clustering result, cleaning the mapping clustering result with similarity, and constructing a mapping dictionary;

based on the constructed mapping dictionary, obtaining a mapping relation, extracting and fusing the characteristics of the mapping relation, and training a prediction model;

acquiring newly added data, and preprocessing the newly added data to obtain newly added field information;

based on the obtained newly added field information, inputting the newly added field information into a trained prediction model to obtain a prediction result, outputting a mapping relation, if the prediction result is null, establishing the mapping relation of the newly added field information, generating a mapping data set according to the mapping relation, and updating a mapping dictionary and the prediction model.

preprocessing the historical data and the newly added data respectively comprises the following steps:

preprocessing a field name, and cleaning the field name to obtain a standard field name;

preprocessing field content, performing field cleaning on the field content to obtain standard field content, and constructing a sample set of the standard field content;

preprocessing field data, de-duplicating the field data, and counting the field data.

Preferably, the sample selection is performed on the history field information, and before the sample set is constructed, the method further comprises: vectorization processing is carried out on the history field information and the newly added field information respectively, and the method comprises the following steps:

acquiring a standard field name, standard field content and field data;

merging the standard field name, the standard field content and the field data into a fusion character string, and if the merged fusion character string is too large, sampling and processing the standard field content and the field data, and merging to construct the fusion character string;

and carrying out vectorization processing on the character string.

Preferably, the clustering result includes a point cluster and noise points,

building the mapping dictionary includes:

clustering calculation: calculating data of the sample set, generating a clustering result, and identifying point clusters and noise points in the clustering result;

feature mapping, mapping clustering results, mapping point clusters and noise points, and constructing a mapping dictionary according to a mapping relation;

manual intervention, providing manual operation during data verification;

data verification, verifying the mapping relation in the feature mapping:

when checking a command, inputting a noise point to feature mapping through manual intervention, and if no mapping relation exists, establishing a new mapping relation;

when the field information checks the command, the similarity of the standard field names is judged through manual intervention, and the mapping relation of the standard field names with the similarity but different meanings is manually determined.

Preferably, extracting and fusing the features of the mapping relation, and training the prediction model includes:

model training, namely acquiring a mapping relation and corresponding standard field names as features, fusing the features by converting the standard field names, inputting the fused features into a prediction model, and training the prediction model;

and updating data, namely acquiring newly added field information with a null prediction result, establishing a mapping relation of the newly added field information, and updating a mapping dictionary and a prediction model.

The utility model provides a data mapping check-up's device which characterized in that: a processor and a memory, the memory storing a computer program executable by the processor, the processor implementing the above method when executing the computer program.

The invention has the technical effects and advantages that:

the data multi-classification module trains the prediction model through the existing historical field information to realize automatic verification of the mapping relation, establishes the mapping relation for the non-existing newly-added field information by inputting the newly-added field information, increases the mapping relation and grouping names, solves the mapping problem of the newly-added field information, perfects the prediction model, further realizes automation of the mapping relation, reduces time consumption of manual operation and improves the working efficiency.

Drawings

FIG. 1 is a schematic diagram of a system of the present invention;

FIG. 2 is a schematic diagram of the method of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The invention provides a method, a system and a device for automatically generating page object codes, as shown in fig. 1-2, wherein the system comprises the following steps: the method is executed by using the modules, and the running environment can be in a Python and comprises the following steps:

s1: inputting a standardized public data dictionary of a plurality of pieces of external tax data, such as Jiangsu tax, anhui tax and the like, to a normalization module to serve as data source information, and mapping the external data source into the standardized public dictionary, wherein the data source information comprises historical data and newly-added data;

the history data and the newly added data are respectively processed by a field preprocessing module, the field names, the field contents and the field data of a plurality of tables are respectively processed,

the preprocessing of the field names comprises: the field names are field cleaned to obtain standard field names, such as duplicate removal spaces, bracket special characters, and the like.

The preprocessing of the field content comprises: and performing field cleaning on the field content, including removing repeated blank spaces, bracketing special characters and the like, so as to obtain standard field content. For data with excessive field content description, secondary processing information with specified text length can be constructed.

The preprocessing of field data comprises: and counting the maximum value, the minimum value and the number of the fields after de-duplication.

After the processing, the standard field names, the standard field contents and the field data in the history data and the newly added data are respectively subjected to vectorization processing to respectively generate history field information and newly added field information, and the method comprises the following steps:

adopting an Embedding tool such as a word vector or a sentence vector to respectively convert a standard field name, standard field content and field data into three Embedding vectors, and splicing the three Embedding vectors into a large vector V in sequence; establishing a mapping relation D between standard field names and coding V _v ＝{v _i :c _i … }, where v _i For the Embedding vector, c _i And repeating the step of establishing the mapping relation for the original field name, wherein the history field information and the newly added field information respectively finish vectorization.

Finally, sampling and selecting N different samples according to the vectorized standard field name, standard field content and field data to construct a sample set, wherein the field data also needs to judge whether the type of the field content is an enumeration value type or a continuous numerical value type, if the type of the field content is an enumeration type field, and constructing enumeration content as a sample List after de-duplication; if the field is continuous, a sample set is constructed, and the maximum value, the minimum value and the number after de-duplication in the sample are obtained; and judging the problem of missing information dimension by constructing a sample set of history field information so as to facilitate subsequent clustering.

S2: by passing throughThe data clustering module processes the sample set: unsupervised aggregation processing is carried out on the set of Embedding V by using a DBSCAN algorithm (clustering algorithm based on density) to obtain a plurality of different groups { g } ₁ ,g ₂ …, a set of partially unclassified scatter points E; wherein, the aggregate requirements for DBSCAN are: the minimum number of samples must be greater than the randomly selected number of samples N for each field;

establishing a unified grouping name specification and establishing a standard mapping relation, namely each grouping g, through a feature mapping module _i Constructing a mapping dictionary: v (V) _i :C _i Wherein V is _i G is g _i C _i To conform to standard field names (i.e., packet names), the set D is combined _v Dictionary, establishing original field name c for point cluster _i To standard field name C _i Mapping relation of (3);

checking the correctness of the noise points and the mapping relation through a data checking module comprises the following steps:

manually checking noise points: checking the similarity of the noise points and the existing mapping relation, judging whether the noise points have combinable attribution subsets, and if the noise points have the relevant subsets, combining the noise points into the existing mapping relation; if no related subset can be related, a new group and a group name are established for storing the mapping relation of the scattered points; and judging a plurality of similar mapping relations through a data verification module so as to consider the mapping relation between the point clusters and the noise points, recording the mapping relation to the noise points through manual verification, considering each clustering result, and constructing a complete mapping relation.

And (3) manually checking the mapping relation: when the correctness of the mapping relation of the clustering result is checked, and two similar standard field names or standard field contents are checked manually, if the semantics are the same but the field names are similar, the two similar standard field names or standard field contents are compatible, and unified naming processing is carried out; if the two standard field names or the standard field contents are similar but semantically different, manual intervention is performed to classify the two similar standard field names or the standard field contents, such as adjusting the standard field names, adjusting the standard field contents or adjusting the field data, so that the two similar standard field names or the standard field contents can be distinguished, such as 'registration date', 'change date', and the mapping relationship is manually determined to avoid confusion.

And by means of manual intervention, the automatic mapping accuracy is improved, and the integrity of a subsequently generated prediction model is ensured.

S3: obtaining all mapping relations in a mapping dictionary;

training a prediction model according to the existing mapping relation, and comprising the following steps:

the mapping recording sequence is scattered randomly, so that the problem of sample sequence is avoided, and the robustness and accuracy of a prediction model are influenced;

the existing transducer Model is utilized to carry out fine tuning on the transducer Model or a multi-classification Model is established on the basis of the transducer Model, in the embodiment, a Bert-chip-Base is adopted as a basic prediction Model, the prediction Model is trained by taking the mapping relation of historical field information as training content, namely, extracting the characteristics of standard fields, generating a prediction Model-X, and the mapping relation of noise points is input in a manual confirmation mode, so that the integrity of the mapping result of the trained prediction Model-X is ensured.

S4: and (3) acquiring newly added field information and updating the newly added field information to a prediction Model, if the generated prediction result is a null value, manually inputting newly added data, and establishing a mapping relation to perfect the prediction Model-X, and manually confirming the consistency of the newly added field information and the existing standard field:

if the artificial confirmation result is consistent, updating the newly added field information to the mapping relation of the group name based on the mapping result;

if the artificial confirmation result is inconsistent, or if the screening result is lower than the preset screening parameter (i.e. the output is null) through the preset screening parameter (such as 50%), a new sample and a mapping relation are established after the clustering result is obtained, the mapping result is manually input into a mapping dictionary, and after a certain new number is accumulated, calibration training is carried out on the prediction Model again, and the prediction Model is updated to the prediction Model-X.

Automatically merging the parts of the newly added field information which are consistent with the standard fields, manually entering a mapping relation and newly establishing a grouping name for the inconsistent newly added field information, and updating the mapping relation to a prediction Model-X; the Model-X of the prediction Model is more complete, so that the standard mapping data set is more complete, and the mapping relation verification is more accurate.

Training a prediction Model through the trained prediction Model-X and through the mapping relation of the history field information, and finally outputting a mapping verification result; the mapping relation is automatically grouped through the data clustering module, a large amount of data exploration analysis time in the early stage is saved, the manual confirmation is combined, the data multi-classification module is further processed, a prediction Model-X is generated, the mapping accuracy of the prediction Model-X is improved through manually confirming the newly added field information to the prediction Model-X, the time cost of manual analysis mining is saved, and the working efficiency is improved.

Corresponding to the system and the method, the invention also provides a device for checking the data mapping, which comprises the following components: the system comprises a processor and a memory, wherein the memory stores a computer program executable by the processor, and the processor realizes a data mapping checking method when executing the computer program.

Finally, it should be noted that: the foregoing description is only illustrative of the preferred embodiments of the present invention, and although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments described, or equivalents may be substituted for elements thereof, and any modifications, equivalents, improvements or changes may be made without departing from the spirit and principles of the present invention.

Claims

1. A system for data mapping verification, characterized by: comprising the following steps:

2. The system for data mapping verification of claim 1, wherein,

the history data and the newly added data each include a field name, a field content and field data,

the normalization module comprises:

3. A system for data mapping verification as defined in claim 1, wherein:

the clustering result includes a point cluster and noise points,

the data clustering module comprises:

4. A system for data mapping verification according to claim 3, wherein:

the data multi-classification module comprises:

5. A data mapping checking method is characterized in that: the method comprises the following steps:

6. The method for data mapping verification of claim 5, wherein:

7. The method for data mapping verification of claim 5, wherein:

sample selection is carried out on the history field information, and the method further comprises the following steps before the sample set is constructed: vectorization processing is carried out on the history field information and the newly added field information respectively, and the method comprises the following steps:

acquiring a standard field name, standard field content and field data;

merging the standard field name, the standard field content and the field data into a fusion character string;

and carrying out vectorization processing on the character string.

8. The method of data mapping verification of claim 6, wherein:

the clustering result includes a point cluster and noise points,

building the mapping dictionary includes:

manual intervention, providing manual operation during data verification;

data verification, verifying the mapping relation in the feature mapping:

9. The method for data mapping verification of claim 5, wherein:

extracting and fusing the features of the mapping relation, and training a prediction model comprises:

10. The utility model provides a data mapping check-up's device which characterized in that: a processor and a memory storing a computer program executable by the processor, the processor implementing the method of any one of claims 5-9 when the computer program is executed.