CN117390170B

CN117390170B - Method and device for matching data standards, electronic equipment and readable storage medium

Info

Publication number: CN117390170B
Application number: CN202311699178.5A
Authority: CN
Inventors: 刘晨; 郑保卫; 刘迪
Original assignee: Encore Beijing Information Technology Co ltd
Current assignee: Encore Beijing Information Technology Co ltd
Priority date: 2023-12-12
Filing date: 2023-12-12
Publication date: 2024-03-08
Anticipated expiration: 2043-12-12
Also published as: CN117390170A

Abstract

The embodiment of the application provides a standard matching method and device of a data standard, electronic equipment and a readable storage medium, and relates to the field of data processing. The method comprises the steps of obtaining a database to be aligned and a preset standard field library, wherein the standard field library comprises a plurality of standard fields; extracting a data dictionary from the database to be aligned; classifying each standard field by using a preset target classification model to obtain a plurality of standard phrase libraries; classifying fields in a data dictionary by using the target classification model to obtain a plurality of to-be-aligned fields; combining the to-be-aligned fields belonging to the same category with the standard phrase library to obtain a plurality of field prompt words; inputting each field prompt word into a preset text conversion model for scaling, and outputting scaling results corresponding to each field to be scaled. The embodiment of the application can improve the accuracy of data targeting.

Description

Method and device for matching data standards, electronic equipment and readable storage medium

Technical Field

The present disclosure relates to the field of data processing technologies, and in particular, to a method and apparatus for matching standard of data, an electronic device, and a readable storage medium.

Background

Standard fielding is a core problem in database management and data analysis. As various databases and data applications increase, the naming and classification of data becomes increasingly complex, and the unification, specification, and standardization of data field naming can ensure the integrity, consistency, and accuracy of data. In many industries, particularly the financial, medical and retail industries, consistency and standardization of data is critical to accurate data analysis and decision making, and data benchmarking is the most important ring thereof. However, the problems of huge workload, incapability of effective automatic control, tedious maintenance and low operability exist for a long time, so that standard field standard-crossing work is seriously hindered, and a manual standard-crossing mode is adopted all the time, so that the manual workload is high. Therefore, the problem of large manual workload can be solved by performing data benchmarking work by using an artificial intelligence or deep learning scheme. However, most of artificial intelligence methods adopt semantic similarity matching, and this method can recommend a standard field with highest similarity for the original field name, however, when the database field is not matched with any standard field, it still recommends the option with highest similarity, which results in inaccuracy of the data on the target recommended option.

Disclosure of Invention

The application provides a data standard alignment method, a data standard alignment device, electronic equipment and a readable storage medium, which can improve the accuracy of data alignment.

The technical scheme of the embodiment of the application is as follows:

in a first aspect, an embodiment of the present application provides a method for aligning data standards, where the method includes:

obtaining a database to be aligned and a preset standard field library, wherein the standard field library comprises a plurality of standard fields;

extracting a data dictionary from the database to be aligned;

classifying each standard field by using a preset target classification model to obtain a plurality of standard phrase libraries;

classifying fields in a data dictionary by using the target classification model to obtain a plurality of to-be-aligned fields;

combining the to-be-aligned fields belonging to the same category with the standard phrase library to obtain a plurality of field prompt words;

inputting each field prompt word into a preset text conversion model for scaling, and outputting scaling results corresponding to each field to be scaled.

In the technical scheme, firstly, a database to be aligned and a preset standard field library are acquired, wherein the standard field library comprises a plurality of standard fields, and data support is provided for subsequent data alignment by acquiring data; extracting a data dictionary from the database to be subjected to alignment to provide support for field classification; classifying each standard field by using a preset target classification model to obtain a plurality of standard phrase libraries; classifying fields in the data dictionary by using the target classification model to obtain a plurality of to-be-aligned fields, and classifying the standard fields and the fields of the data dictionary can reduce the calculation amount of subsequent matching and improve the matching accuracy; combining the to-be-aligned fields belonging to the same category with the standard phrase library to obtain a plurality of field prompt words; and inputting each field prompt word into a preset text conversion model for scaling, outputting scaling results corresponding to each field to be scaled, combining the field prompt word with the text conversion model, realizing accurate scaling option output, and improving the accuracy of data scaling.

In some embodiments of the present application, the extracting the data dictionary from the to-be-aligned database includes:

according to a preset data table query statement, querying the data table in the to-be-compared database to obtain a plurality of query fields;

and taking the query term in the data table query statement as a key name, and storing the query field corresponding to the query term as a key value to obtain the data dictionary.

In the technical scheme, the mapping relation between the query term and the query field can be obtained by constructing the data dictionary, and support is provided for field classification in the follow-up process.

In some embodiments of the present application, the combining each to-be-tagged field and the standard phrase library, which belong to the same category, to obtain a plurality of field prompt words includes:

determining field association information corresponding to each field to be compared according to the category of each field to be compared and the data dictionary;

and adding the fields to be aligned, the standard phrase library and the field association information belonging to the same category into a preset prompt library to obtain the field prompt words corresponding to the fields to be aligned.

In the technical scheme, each to-be-labeled field of the same category is respectively subjected to field prompt words with the standard phrase library, so that the follow-up prediction output of the to-be-labeled field is facilitated.

In some embodiments of the present application, before classifying each of the standard fields by using a preset target classification model to obtain a plurality of standard phrase libraries, the method further includes:

acquiring a training standard library with labels, wherein the training standard library comprises a plurality of training data, target class data corresponding to each training data and preset classification standards;

classifying each training data by using a preset initial classification model to obtain a plurality of prediction category data;

calculating according to each prediction category data and each target category data to obtain classification accuracy;

judging whether the classification accuracy accords with the preset classification standard or not to obtain a judgment result;

and generating the target classification model under the condition that the judging result accords with the preset classification standard.

In the technical scheme, the training data is classified by using the initial classification model, the classification accuracy is calculated, and the target classification model is generated under the condition that the classification accuracy accords with the classification standard, so that the target classification model can have better classification accuracy.

In some embodiments of the present application, after the determining whether the classification accuracy meets the preset classification standard, the method further includes:

and executing the step of acquiring the training standard library with labels after the initial classification model is subjected to parameter adjustment by utilizing the classification accuracy under the condition that the judgment result does not accord with the preset classification standard, until the target classification model is generated under the condition that the judgment result accords with the preset classification standard.

In the technical scheme, under the condition that the classification accuracy does not accord with the classification standard, training is continued on the initial classification model until the classification accuracy accords with the classification standard to generate the target classification model, so that the target classification model can have better classification accuracy.

In some embodiments of the present application, before the inputting each field prompt word into a preset text conversion model for scaling, and outputting a scaling result, the method further includes:

acquiring a plurality of classified training standard fields and a training standard field library;

combining the training standard field and the training standard field library belonging to the same category to obtain training field prompt words corresponding to the training standard fields;

And carrying out parameter adjustment on a preset initial text conversion model based on each training field prompt word to obtain the text conversion model.

According to the technical scheme, the initial text conversion model is trained based on the training field prompt words, so that the text conversion model is obtained, the prediction result output by the text conversion model is accurate, and the accuracy of the output standard result is improved.

In some embodiments of the present application, the benchmarking result includes a prediction field and a null field;

inputting each field prompt word into a preset text conversion model for scaling, and outputting scaling results corresponding to each field to be scaled, wherein the scaling results comprise:

inputting each field prompt word into a preset text conversion model for scaling, and outputting the blank field under the condition that the to-be-scaled field and the standard phrase library are not matched;

and outputting the prediction field under the condition that the to-be-aligned field is matched with the standard phrase library.

In the technical scheme, the output of the null field is supported, more selection space is provided for the data field, and the output is more accurate.

In a second aspect, an embodiment of the present application provides a device for aligning data standards, where the device includes:

The data acquisition module is used for acquiring a database to be aligned and a preset standard field library, wherein the standard field library comprises a plurality of standard fields;

the data extraction module is used for extracting a data dictionary from the database to be aligned;

the first classification module is used for classifying each standard field by utilizing a preset target classification model to obtain a plurality of standard phrase libraries;

the second classification module is used for classifying the fields in the data dictionary by utilizing the target classification model to obtain a plurality of to-be-aligned fields;

the data combination module is used for combining the to-be-aligned fields belonging to the same category with the standard phrase library to obtain a plurality of field prompt words;

and the bid alignment processing module is used for inputting each field prompt word into a preset text conversion model to perform bid alignment and outputting a bid alignment result corresponding to each field to be aligned.

In a third aspect, an embodiment of the present application provides an electronic device, including a processor, a memory, a user interface, and a network interface, where the memory is configured to store instructions, and the user interface and the network interface are configured to communicate with other devices, and the processor is configured to execute the instructions stored in the memory, so that the electronic device performs the method provided in any one of the first aspect above.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium storing instructions that, when executed, perform the method of any one of the first aspects provided above.

In summary, one or more technical solutions provided in the embodiments of the present application at least have the following technical effects or advantages:

1. the method adopts the technical means that the standard fields and the fields in the data dictionary are classified by utilizing the target classification model, then the standard phrase library is combined with the fields to be aligned, and then the text conversion model is input, so that the problem that the data is inaccurate in recommending options of the targets because the option with highest similarity is still recommended when the database fields are not matched with any standard fields in the related technology is effectively solved. According to the embodiment of the application, the calculated amount can be reduced through classification, the matching accuracy can be improved, and the field prompt words are combined with the text conversion model, so that more accurate bid-alignment option data are output.

2. The output of the null field is supported, more selection space is provided for the data field, and the output is more accurate.

3. By setting a preset standard database and classifying, when the standard fields in the standard database are changed, only the standard fields of the related categories are required to be updated, field representations of all the categories are not required to be updated, and the calculated amount is reduced.

Drawings

FIG. 1 is a flow chart of a method for benchmarking data standards provided in one embodiment of the present application;

FIG. 2 is a schematic flow chart showing a sub-step of step S120 in FIG. 1;

FIG. 3 is a schematic flow chart showing a sub-step of step S150 in FIG. 1;

FIG. 4 is a flow chart of a method for benchmarking data standards provided in accordance with another embodiment of the present application;

FIG. 5 is a flow chart of a method for benchmarking data standards provided in accordance with yet another embodiment of the present application;

FIG. 6 is a schematic structural diagram of a standard-to-standard device according to one embodiment of the present application;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make those skilled in the art better understand the technical solutions in the present specification, the technical solutions in the embodiments of the present specification will be clearly and completely described below with reference to the drawings in the embodiments of the present specification, and it is obvious that the described embodiments are only some embodiments of the present application, but not all embodiments.

In the description of embodiments of the present application, words such as "for example" or "for example" are used to indicate examples, illustrations or descriptions. Any embodiment or design described herein as "such as" or "for example" should not be construed as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "or" for example "is intended to present related concepts in a concrete fashion.

In the description of the embodiments of the present application, the term "plurality" means two or more. For example, a plurality of systems means two or more systems, and a plurality of screen terminals means two or more screen terminals. Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating an indicated technical feature. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. The terms "comprising," "including," "having," and variations thereof mean "including but not limited to," unless expressly specified otherwise.

First, terms related to the present application will be described.

Standard field: refers to the normalized naming of the Chinese names of the database fields. This is typically done by extracting a dictionary of data from a database, examining the chinese nomenclature. A standard field is a collection of standard phrases that is intended to unify and normalize the naming of data.

Data dictionary: it is a directory or list that stores all of the object information in the database, such as tables, columns, data types, and indexes, and a data dictionary is commonly used for database management and maintenance.

And (3) marking: the method is a process of comparing the enterprise full data dictionary with the standard fields, and aims to check whether the Chinese name of each field accords with the description of the standard field, and if so, a mapping relation is established.

BERT model: the bi-directional encoder (Bidirectional Encoder Representations from Transformers, BERT) is a pre-trained deep learning model for processing various natural language processing tasks. By pre-training large amounts of text data, BERT can capture complex relationships between words, which can then be fine-tuned to suit a particular task. The BERT model performs well in processing text tasks, with an inherent input length limitation, typically 512 labels. When scaling directly across the standard field library and field names, this length limit may be exceeded by the amount of data that is too large. By employing a classification screening method, the points of interest can be limited to a smaller and more targeted subset of data, ensuring that the BERT model can effectively process these texts within its own length limitations.

T5 (Text-to-Text Transfer Transformer, text-to-Text conversion): is a pre-trained deep learning model designed for the task of converting text to text, such as translation, abstract and question answering. T5 is used for converting text into text according to all natural language processing tasks, and simplifying and unifying the model by constructing unified input and output formats.

prompt: in this context, campt is a designed question or instruction that directs a model to perform a particular task, such as recommendation of a standard field phrase.

MASK: in deep learning models, particularly BERT and T5, MASK generally refers to a location predicted by the model that is occluded or hidden in the input.

The embodiment of the application provides a data standard alignment method, a device, electronic equipment and a readable storage medium, wherein the data standard alignment method firstly acquires a to-be-aligned database and a preset standard field library, the standard field library comprises a plurality of standard fields, and data support is provided for subsequent data alignment by acquiring data; extracting a data dictionary from a database to be subjected to alignment to provide support for field classification; classifying each standard field by using a preset target classification model to obtain a plurality of standard phrase libraries; classifying fields in the data dictionary by utilizing the target classification model to obtain a plurality of to-be-aligned fields, and classifying the standard fields and the fields of the data dictionary can reduce the calculation amount of subsequent matching and can improve the matching accuracy; combining each field to be marked belonging to the same category with a standard phrase library to obtain a plurality of field prompt words; and inputting each field prompt word into a preset text conversion model for scaling, outputting scaling results corresponding to each field to be scaled, combining the field prompt word with the text conversion model, realizing accurate scaling option output, and improving the accuracy of data scaling. Compared with the prior art that when the database field is not matched with any standard field, the method still recommends the option with the highest similarity, and the data is inaccurate in comparison with the target recommended option, the method and the device can reduce the calculated amount, improve the matching accuracy through classification, and combine the field prompt word with the text conversion model to output more accurate target option data.

It should be noted that the standard matching method of the data standard can be applied to the finance, medical treatment and retail industries, has wider application industries, and can ensure the uniformity, the standardization and the standardization of the naming of the data fields in the database management and the data analysis, thereby ensuring the integrity, the consistency and the accuracy of the data.

The technical scheme provided by the embodiment of the application is further described below with reference to the accompanying drawings.

Referring to fig. 1, fig. 1 is a flow chart of a standard matching method of a data standard according to an embodiment of the present application. The standard alignment method is applied to the standard alignment device, and is executed by the electronic device or the processor in the readable storage medium, and comprises the steps of S110, S120, S130, S140, S150 and S160.

Step S110, a database to be aligned and a preset standard field library are obtained, wherein the standard field library comprises a plurality of standard fields.

In one embodiment, the database to be aligned is data collected from industry, which may be banking industry, retail industry, etc., and is arranged and stored to form the database to be aligned. Firstly, a database to be aligned is obtained through a preset database reading function, and a database package can be imported in a python program to connect and obtain the database to be aligned. The to-be-aligned database comprises a plurality of data tables, each data table is provided with a table name, each data table comprises a plurality of table records, and the table names and the table records can be supported to be obtained from the to-be-aligned database subsequently to obtain a data dictionary. The preset standard field library is a standard field library which is obtained by setting by professionals according to industry or common universal vocabulary, and comprises a plurality of standard fields, wherein each standard field records a uniform naming name. Illustratively, in banking, there are several standard field chinese names: "deposit account", "loan amount", "credit rating", "transaction platform", etc. And supporting the subsequent data alignment operation by obtaining the database to be aligned and a preset standard field library.

It should be noted that the data in the target database and the preset standard field database belong to the same industry data.

Step S120, extracting a data dictionary from the database to be compared.

In an embodiment, the to-be-marked database may be a structured storage database or an unstructured storage database, and according to the storage type of the to-be-marked database, the to-be-marked database is queried in a structured data query mode to obtain a data dictionary, and in the unstructured storage mode, the to-be-marked database is queried in an unstructured data query mode to obtain the data dictionary. By analyzing the target database, the data dictionary can include all fields in the target database to be mapped and form a mapping relationship, thereby supporting subsequent classification operations. The following description will take the to-be-compared object database as a structured storage.

As shown in fig. 2, the data dictionary is extracted from the database to be benchmarked, including but not limited to the following steps:

step S121, according to a preset data table query statement, a data table in a target database is queried to obtain a plurality of query fields.

In some possible embodiments of the present application, the preset data table query statement may be a structured query statement (SQL statement), and the target to be queried is used as a query term by using the table data information stored in the table or the view of the preset data table query statement query system, the table name, the table field name, and the like of the data table may be queried as the query term, and the result of the query is a specific table name and the table field name as query fields, so as to obtain a plurality of query fields. And inquiring different inquiry items through the data table inquiry statement to obtain a plurality of inquiry fields corresponding to inquiry, which is beneficial to constructing a data dictionary according to the information.

Illustratively, the table name is queried through a data table query statement, specific information of the table name is obtained, the table name is a query term, and the specific information of the table name is a query field; the column field information of the statement query table can be queried through the data table to obtain the specific information of all column field information of the table, wherein the column field information is a query item, and the specific information of the column field information is a query field; the field type length can be queried through the query statement of the data table, all types of information of the table can be obtained, the field type length is a query item, the type information is a query field and the like, other information can be queried, and the method is not repeated herein, so that the data dictionary can be constructed according to the information later.

Step S122, the query items in the data table query statement are used as key names, and the query fields corresponding to the query items are used as key values to be stored, so that the data dictionary is obtained.

In some possible embodiments of the present application, since the key-value pairs can reflect the correspondence between the keys and the values, query fields obtained by querying in step S121 are used as key names for query items in the data table query statement, and query fields corresponding to the query items are stored as key values to obtain a data dictionary, so that not only can the query items correspond to each query field, but also all the query fields in the database to be aligned can be obtained by querying as described above, and management and maintenance are performed. Illustratively, the data dictionary may be expressed as { Table English name: target table english, table chinese name: target table chinese, field english name: target field English, field Chinese name: user deposit account, field type length: target field type length }.

And step S130, classifying each data standard by using a preset target classification model to obtain a plurality of standard phrase libraries.

In an embodiment, the predetermined object classification model may be a bi-directional encoder, i.e. a BERT model, or a variant of the BERT model. The preset target classification model is a trained model, standard requirements and preset categories are classified by using the BERT model according to data, so that a plurality of standard phrase libraries are obtained, one standard phrase library corresponds to one category, each standard phrase library comprises a plurality of standard phrases, and the standard phrases belong to the same category corresponding to the standard phrase library, not only can meet the processing requirements of the BERT model, but also can classify each standard field to obtain a plurality of standard phrase libraries of different categories. By dividing each standard phrase category, only the standard field of the related category is required to be updated when the standard field is updated, field representations of all the categories are not required to be updated, the calculated amount is reduced, and the management efficiency is improved; and the method is also beneficial to the subsequent data benchmarking processing according to classification, and the calculated amount is saved. The preset categories are preset by professionals for each industry.

Illustratively, banking has a diversity of over 100 sub-categories selected for categorization and integration by 8 major categories of organization, product, contract, transaction, asset, finance, channel, and public. The standard field may be a deposit account, a deposit line, a transaction platform, etc., and the deposit account and the deposit line may be classified into storage subclasses in the product class by classification; transaction platforms may be categorized into platform subclasses under common classes, etc. Correspondingly, the storage subclass is a standard phrase library, and the deposit account and the deposit amount under the storage subclass are standard phrases; the platform is a standard phrase library, and the transaction platform is a standard phrase.

And step S140, classifying the fields in the data dictionary by using the target classification model to obtain a plurality of to-be-aligned fields.

In an embodiment, the object classification model may be a bi-directional encoder, or a BERT model, or a variant of the BERT model. The fields in the data dictionary are classified by utilizing the target classification model, key values in the data dictionary are extracted according to the storage form of the data dictionary, and the fields corresponding to the key values are classified by utilizing the BERT model according to the preset category to obtain a plurality of to-be-aligned fields, wherein the specific classification process is similar to the step S130, and details are omitted. The to-be-classified fields are classified, so that the combination with a standard phrase library is facilitated, and the calculated amount is reduced.

And step S150, combining each field to be aligned belonging to the same category with the standard phrase library to obtain a plurality of field prompt words.

In one embodiment, since the field to be benchmarked and the standard phrase library are words of the same industry, the preset categories are the same. According to classifying the fields to be aligned and the standard phrase library, the fields to be aligned and the standard phrase library belonging to the same category can be divided together. Because the number of the to-be-labeled fields is multiple, each to-be-labeled field is respectively combined with the standard phrase library under the class to obtain a plurality of field prompt words, the field prompt words are combined with the text conversion model in a follow-up manner, and the accuracy of an output result is improved.

As shown in fig. 3, each to-be-tagged field belonging to the same category and the standard phrase library are combined to obtain a plurality of field prompt words, including but not limited to the following steps:

step S151, according to the category and the data dictionary of each field to be compared, determining the field association information corresponding to each field to be compared.

In some possible embodiments of the present application, according to the category of the field to be aligned and the data dictionary, the field to be aligned is searched in the data dictionary, and after the field to be aligned is determined, a table name, a field type, etc. related to the field can be determined, and a key value of related information in the data dictionary is used as field association information. The field association information is obtained, so that the field association information is added into the field prompt word in the follow-up process, and the prediction accuracy is improved. Illustratively, the data dictionary is { Table English name: target table english, table chinese name: target table chinese, field english name: target field English, field Chinese name: user deposit account, field type length: target field type length }, the target field to be compared is a user deposit account, the related information such as the table English name, the table Chinese name, the field type length and the like related to the user deposit account can be selected from the data dictionary, and the key value corresponding to the related information is searched in the data dictionary to obtain the field related information such as the target table English, the target table Chinese, the target field type length and the like.

Step S152, adding the to-be-aligned fields, the standard phrase library and the field association information belonging to the same category to a preset prompt library to obtain field prompt words corresponding to each to-be-aligned field.

In some possible embodiments of the present application, because the field association information and the to-be-aligned field belong to the same category, the to-be-aligned field, the standard phrase library and the field association information which belong to the same category are added to the preset prompt library, and combination matching is not required for other categories, so that the calculation amount can be reduced. The preset prompt library can be a list or a tuple, and the fields to be aligned, the standard phrase library and the field association information can be added into the prompt library according to a list rule to obtain field prompt words corresponding to the fields to be aligned, namely, prompt. And the method can also be added into a preset prompt library according to a tuple rule to obtain field prompt words corresponding to each field to be labeled, namely, the prompt, so that the text conversion model is guided to carry out the labeling task by using the prompt.

Step S160, inputting each field prompt word into a preset text conversion model for scaling, and outputting scaling results corresponding to each field to be scaled.

In an embodiment, the preset text conversion model may be a T5 model, the T5 model is a trained model, the field prompt word is input into the T5 model, the MASK position is predicted according to the instruction of the prompt, the recommended phrase is output, and the benchmarking result is output. By combining the promtt with the T5 model and guiding by utilizing the promtt, the T5 model can accurately output recommended phrases, and the accuracy of data targeting is improved.

Illustratively, the data dictionary is { Table English name: target table english, table chinese name: target table chinese, field english name: target field English, field Chinese name: user deposit account, field type length: the target field type length }, marking the user deposit account as MASK, predicting the vocabulary of the position, wherein the project comprises a field to be aligned, field association information and a standard phrase library, the project is used for guiding T5 to predict the position of the MASK, the user deposit account is closest to the deposit account, the deposit account exists in the standard phrase library and is summarized, the user deposit account at the position is replaced by the deposit account, the prediction result of the deposit account is output, the uniformity of data is realized, and the management and the maintenance are convenient.

In an embodiment, the bid matching result includes a prediction field and an empty field, each field prompt word is input into a preset text conversion model to perform bid matching, and a bid matching result corresponding to each to-be-matched field is output.

In another embodiment, under the condition that the field to be aligned is matched with the standard phrase library, the field to be aligned is indicated to be similar to the standard expression corresponding to the standard phrase library, and the prediction field is output, so that the prediction result is accurate, and unified expression of data can be realized. The judgment that the to-be-tagged field is close to or similar to the standard phrase in the standard phrase library is as follows: and calculating the two through a character string similarity algorithm, setting a threshold value, considering the two to be close or similar when the similarity exceeds the threshold value, outputting a null field when the similarity between the standard phrase and the data to be aligned in the standard phrase library exceeds the threshold value, and taking the standard phrase corresponding to the highest similarity value when the similarity between the standard phrase and the data to be aligned is more than two. The accurate target option data can be ensured to be output.

As shown in fig. 4, before classifying each standard field by using a preset target classification model to obtain a plurality of standard phrase libraries, the method for matching data standards further includes, but is not limited to, the following steps:

step S210, a training standard library with labels is obtained, wherein the training standard library comprises a plurality of training data, target class data corresponding to each training data and preset classification standards.

In an embodiment, the training standard library can be used for collecting and sorting data in each industry, setting data naming by professionals according to industry rules, manually marking the data to form a training standard library, and storing the training standard library, wherein related industry data can be obtained through crawler technology, sorting, data naming and category marking are performed, and the training standard library is obtained through storage. And then acquiring a training standard library with labels by using a preset database reading function, wherein the training standard library comprises a plurality of training data, target class data corresponding to each training data and preset classification standards, and the training of the classification model by using the data is facilitated.

Step S220, classifying each training data by using a preset initial classification model to obtain a plurality of prediction category data.

In an embodiment, the preset initial classification model may be a bi-directional encoder, i.e. a BERT model, or a variant of the BERT model. And classifying each training data by using the BERT model according to the data standard matching requirements and the target class data, namely classifying each training data by using the BERT model to obtain a plurality of prediction class data, which is beneficial to the subsequent accuracy calculation according to the prediction class data so as to adjust the BERT model.

Step S230, calculating and obtaining classification accuracy according to each prediction type data and target type data.

In an embodiment, the accuracy calculation is performed on the predicted category data and the target category data obtained in step S220, and the similarity degree between each predicted category data and the target category data can be calculated by using the preset character string similarity, and then weighted summation, average calculation and the like are performed to obtain the overall classification accuracy, wherein the classification accuracy is a numerical value type, so that whether the target classification model is generated or not can be determined according to the classification accuracy later. The classification accuracy of the predicted category data and the target category data can also be calculated by using other text similarity such as Euclidean distance, and the description is omitted here.

Step S240, judging whether the classification accuracy accords with a preset classification standard, and obtaining a judgment result.

In an embodiment, according to the classification accuracy obtained in step S230, it is determined whether the classification accuracy meets a preset classification standard, where the preset classification standard is set by an expert according to experience, and meets the classification of each industry. The preset classification standard is a numerical value type, and can be the comparison of the classification accuracy and the size of the classification standard to obtain a judgment result. The method can also carry out difference mathematical calculation on the classification accuracy and the classification standard, and then compare the difference mathematical calculation with zero to obtain a judgment result, thereby being beneficial to determining whether to generate the target classification model according to the judgment result.

Step S250, generating a target classification model under the condition that the judgment result meets the preset classification standard.

In an embodiment, when the judgment result is that the preset classification standard is met, that is, the classification accuracy is smaller than the classification standard, a target classification model is generated for classifying the standard field.

As shown in fig. 4, after determining whether the classification accuracy meets the preset classification standard, and obtaining the determination result, the method for matching the data standard further includes, but is not limited to, the following steps:

step S260, when the judging result is that the preset classification standard is not met, the step S260 is executed to acquire the training standard library with labels after the initial classification model is subjected to parameter adjustment by utilizing the classification accuracy, and the target classification model is generated when the judging result is that the preset classification standard is met.

In an embodiment, when the judgment result is that the classification accuracy is not in accordance with the preset classification standard, that is, when the classification accuracy is greater than or equal to the classification standard, taking the classification accuracy as a loss function, performing parameter adjustment on an initial classification model through back propagation, and executing the step of acquiring a training standard library with labels after parameter adjustment, and classifying each training data by using the preset initial classification model to obtain a plurality of prediction category data; calculating according to each prediction category data and target category data to obtain classification accuracy; judging whether the classification accuracy accords with a preset classification standard or not to obtain a judgment result; and training the model continuously according to the judging result until the target classification model is generated and used for classifying the standard field in the follow-up process under the condition that the judging result accords with the preset classification standard, namely the classification accuracy is smaller than the classification standard.

In another embodiment, parameter adjustment is performed on the target classification model obtained through training, the obtained target classification model is used for classifying fields in the data dictionary, the model is not required to be trained again, and the calculated amount is reduced.

As shown in fig. 5, before inputting each field prompt word into a preset text conversion model to perform scaling, and outputting a scaling result, the scaling method of the data standard further includes, but is not limited to, the following steps:

in step S310, a plurality of classified training benchmarking fields and a training standard field library are acquired.

In some embodiments of the present application, the training standard library obtained according to step S210 may be collected and arranged for each industry, where the collected data is not named by a professional, is arranged in a form of a data table, and is stored as a database, and the database obtaining manner is similar to step S210, and the manner of obtaining the data dictionary according to the database, specifically, the manner of obtaining the data dictionary, is described in step S120, which is not repeated herein. And classifying the fields in the data dictionary through the target classification model obtained after parameter adjustment to obtain classified training target fields, wherein the data has a plurality of training target fields. The classified training standard field library is target category data obtained by manual labeling, and the classified training standard field can be obtained by acquiring the training standard library and extracting the target category data. The classified training standard field and the training standard field library are obtained, so that the training of the initial text conversion model is facilitated.

Step S320, combining the training benchmarking fields belonging to the same category with the training standard field library to obtain training field prompt words corresponding to the training benchmarking fields.

In some embodiments of the present application, according to the training pair standard field and the training standard field library classified in step S310, each training pair standard field may be respectively combined with the training standard field library set under the same category, specifically: and adding each training benchmarking field and each training standard field library group into a preset prompt library to obtain training field prompt words corresponding to each training benchmarking field. The training of the text conversion model according to the training field prompt words is facilitated.

In some possible embodiments of the present application, according to the category of the training benchmarking field and the data dictionary, the training benchmarking field is searched in the data dictionary, and after the training benchmarking field is determined, information related to the training benchmarking field can be determined, and a key value of the related information in the data dictionary is used as field association information. Because the field association information and the training benchmarking field belong to the same category, the training benchmarking field, the field association information and the training standard field library are added to a preset prompt library to obtain training field prompt words, and the training of a text conversion model according to the training field prompt words is facilitated.

It should be noted that, adding the training standard field and the training standard field library set to the preset prompt library, and adding the training standard field, the field association information and the training standard field library to the preset prompt library are similar to step S152, and are not described here again.

And step S330, carrying out parameter adjustment on a preset initial text conversion model based on each training field prompt word to obtain the text conversion model.

In some embodiments of the present application, a plurality of training field prompt words are input into an initial text conversion model according to a preset batch, and a vocabulary of predicted MASK positions is output according to the training field prompt words. Firstly, determining words, close to the training field prompting words, of the predicted MASK position words, calculating the predicted MASK position words and the words, close to the training field prompting words, of the predicted MASK position words by using a preset loss function to obtain a loss function value, and carrying out parameter adjustment on an initial text conversion model according to the loss function value to obtain the text conversion model. The method is favorable for combining a text conversion model with the prompt in the follow-up process, and accurate bid comparison option data is output. Wherein, the preset batch may be 2, 4, 8, etc., and will not be described herein; the predetermined loss function may be a cross entropy loss function.

In other embodiments of the present application, the validation set may also be used to fine tune the text conversion model after the initial text conversion model is parameterized with the values of the loss function. The fine tuning stage is usually trained by using task data with labels as a verification set, and super parameters of the trained model are adjusted according to the result to obtain a text conversion model. The method is favorable for combining a text conversion model with the prompt in the follow-up process, and accurate bid comparison option data is output.

As shown in fig. 6, an embodiment of the present application provides a standard alignment device 100 of data standard, where the device 100 firstly obtains a database to be aligned and a preset standard field library through a data obtaining module 110, the standard field library includes a plurality of standard fields, and provides data support for subsequent data alignment through obtaining data; then, extracting the data dictionary from the database to be aligned through the data extraction module 120 to provide support for the subsequent field classification; classifying each standard field by using a preset target classification model by using the first classification module 130 to obtain a plurality of standard phrase libraries; the second classification module 140 is used for classifying the fields in the data dictionary by using the target classification model to obtain a plurality of to-be-aligned fields, and the calculation amount of subsequent matching can be reduced and the matching accuracy can be improved by classifying the standard fields and the fields of the data dictionary; then, the data combination module 150 combines each field to be aligned belonging to the same category with the standard phrase library to obtain a plurality of field prompt words; finally, the label matching processing module 160 is utilized to input each field prompting word into a preset text conversion model for label matching, a label matching result corresponding to each field to be labeled is output, and the field prompting word is combined with the text conversion model, so that accurate label matching option output is realized, and the accuracy of data label matching is improved.

It should be noted that, the data obtaining module 110 is connected to the data extracting module 120, the data extracting module 120 is connected to the first classifying module 130, the first classifying module 130 is connected to the second classifying module 140, the second classifying module 140 is connected to the data combining module 150, and the data combining module 150 is connected to the benchmarking module 160. The data standard matching method is applied to the data standard matching device 100, the data standard matching device 100 can reduce the calculated amount through classification, can improve the matching accuracy, and combines the field prompt words with the text conversion model to realize the output of more accurate matching option data.

Also to be described is: in the device provided in the above embodiment, when implementing the functions thereof, only the division of the above functional modules is used as an example, in practical application, the above functional allocation may be implemented by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to implement all or part of the functions described above. In addition, the embodiments of the apparatus and the method provided in the foregoing embodiments belong to the same concept, and specific implementation processes of the embodiments of the method are detailed in the method embodiments, which are not repeated herein.

The application also discloses electronic equipment. Referring to fig. 7, fig. 7 is a schematic structural diagram of an electronic device according to the disclosure in an embodiment of the present application. The electronic device 500 may include: at least one processor 501, at least one network interface 504, a user interface 503, a memory 505, at least one communication bus 502.

Wherein a communication bus 502 is used to enable connected communications between these components.

The user interface 503 may include a Display screen (Display) and a Camera (Camera), and the optional user interface 503 may further include a standard wired interface and a standard wireless interface.

The network interface 504 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface), among others.

Wherein the processor 501 may include one or more processing cores. The processor 501 connects various parts throughout the server using various interfaces and lines, performs various functions of the server and processes data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 505, and invoking data stored in the memory 505. Alternatively, the processor 501 may be implemented in hardware in at least one of digital signal processing (Digital Signal Processing, DSP), field programmable gate array (Field-Programmable Gate Array, FPGA), programmable logic array (Programmable Logic Array, PLA). The processor 501 may integrate one or a combination of several of a central processing unit (Central Processing Unit, CPU), an image processor (Graphics Processing Unit, GPU), and a modem, etc. The CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for rendering and drawing the content required to be displayed by the display screen; the modem is used to handle wireless communications. It will be appreciated that the modem may not be integrated into the processor 501 and may be implemented by a single chip.

The Memory 505 may include a random access Memory (Random Access Memory, RAM) or a Read-Only Memory (Read-Only Memory). Optionally, the memory 505 comprises a non-transitory computer readable medium (non-transitory computer-readable storage medium). Memory 505 may be used to store instructions, programs, code sets, or instruction sets. The memory 505 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing the above-described various method embodiments, etc.; the storage data area may store data or the like involved in the above respective method embodiments. The memory 505 may also optionally be at least one storage device located remotely from the processor 501. Referring to fig. 7, an operating system, a network communication module, a user interface module, and an application program of a standard benchmarking method of data can be included in a memory 505 as a kind of computer storage medium.

In the electronic device 500 shown in fig. 7, the user interface 503 is mainly used for providing an input interface for a user, and acquiring data input by the user; and the processor 501 may be configured to invoke an application in the memory 505 that stores a benchmarking method of data standards, which when executed by the one or more processors 501, causes the electronic device 500 to perform the method as in one or more of the embodiments described above. It should be noted that, for simplicity of description, the foregoing method embodiments are all expressed as a series of action combinations, but it should be understood by those skilled in the art that the present application is not limited by the order of actions described, as some steps may be performed in other order or simultaneously in accordance with the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required in the present application.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to related descriptions of other embodiments.

In the several embodiments provided herein, it should be understood that the disclosed apparatus may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, such as a division of units, merely a division of logic functions, and there may be additional divisions in actual implementation, such as multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some service interface, device or unit indirect coupling or communication connection, electrical or otherwise.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable memory. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a memory, including several instructions for causing a computer device (which may be a personal computer, a server or a network device, etc.) to perform all or part of the steps of the methods of the embodiments of the present application. And the aforementioned memory includes: various media capable of storing program codes, such as a U disk, a mobile hard disk, a magnetic disk or an optical disk.

The above are merely exemplary embodiments of the present disclosure and are not intended to limit the scope of the present disclosure. That is, equivalent changes and modifications are contemplated by the teachings of this disclosure, which fall within the scope of the present disclosure. Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure.

This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a scope and spirit of the disclosure being indicated by the claims.

Claims

1. A method for aligning data standards, the method comprising:

extracting a data dictionary from the database to be aligned;

inputting each field prompt word into a preset text conversion model for scaling, and outputting scaling results corresponding to each field to be scaled;

The extracting the data dictionary from the database to be compared with the standard comprises the following steps:

taking a query term in the data table query statement as a key name, and storing the query field corresponding to the query term as a key value to obtain the data dictionary;

combining each to-be-aligned field belonging to the same category with the standard phrase library to obtain a plurality of field prompt words, wherein the method comprises the following steps:

2. The method of claim 1, wherein before classifying each of the standard fields using a predetermined target classification model to obtain a plurality of standard phrase libraries, the method further comprises:

3. The method according to claim 2, wherein after said determining whether said classification accuracy meets said preset classification criterion, the method further comprises:

4. The method of claim 1, wherein before inputting each of the field prompt words into a preset text conversion model for scaling, and outputting a scaling result, the method further comprises:

5. The method of claim 1, wherein the benchmarking result includes a prediction field and a null field;

6. A data standard alignment device, the device comprising:

the data acquisition module (110) is used for acquiring a database to be aligned and a preset standard field library, wherein the standard field library comprises a plurality of standard fields;

The data extraction module (120) is used for extracting a data dictionary from the database to be calibrated;

the first classification module (130) is used for classifying each standard field by utilizing a preset target classification model to obtain a plurality of standard phrase libraries;

the second classification module (140) is used for classifying fields in the data dictionary by utilizing the target classification model to obtain a plurality of to-be-aligned fields;

the data combination module (150) is used for combining the to-be-aligned fields belonging to the same category with the standard phrase library to obtain a plurality of field prompt words;

the bid-alignment processing module (160) is used for inputting each field prompt word into a preset text conversion model to perform bid alignment and outputting a bid-alignment result corresponding to each field to be aligned;

the data extraction module (120) is specifically configured to query the data table in the to-be-compared database according to a preset data table query statement to obtain a plurality of query fields; taking a query term in the data table query statement as a key name, and storing the query field corresponding to the query term as a key value to obtain the data dictionary;

the data combination module (150) is specifically configured to determine field association information corresponding to each field to be aligned according to the category of each field to be aligned and the data dictionary; and adding the fields to be aligned, the standard phrase library and the field association information belonging to the same category into a preset prompt library to obtain the field prompt words corresponding to the fields to be aligned.

7. An electronic device comprising a processor (501), a memory (505), a user interface (503), a communication bus (502) and a network interface (504), the processor (501), the memory (505), the user interface (503) and the network interface (504) being respectively connected to the communication bus (502), the memory (505) being adapted to store instructions, the user interface (503) and the network interface (504) being adapted to communicate to other devices, the processor (501) being adapted to execute the instructions stored in the memory (505) to cause the electronic device (500) to perform the method according to any of claims 1-5.

8. A computer readable storage medium storing instructions which, when executed, perform the method of any one of claims 1-5.