CN117421421A - Multi-source data dictionary fusion method and device, medium and equipment - Google Patents

Multi-source data dictionary fusion method and device, medium and equipment Download PDF

Info

Publication number
CN117421421A
CN117421421A CN202311345875.0A CN202311345875A CN117421421A CN 117421421 A CN117421421 A CN 117421421A CN 202311345875 A CN202311345875 A CN 202311345875A CN 117421421 A CN117421421 A CN 117421421A
Authority
CN
China
Prior art keywords
data dictionary
data
similarity
fused
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311345875.0A
Other languages
Chinese (zh)
Inventor
杨万哲
王庆
王历
Original Assignee
东北大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 东北大学 filed Critical 东北大学
Priority to CN202311345875.0A priority Critical patent/CN117421421A/en
Publication of CN117421421A publication Critical patent/CN117421421A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

The application discloses a multi-source data dictionary fusion method. The method comprises the following steps: acquiring a data dictionary to be fused and data logic description information; extracting a first keyword from a data dictionary table of a data dictionary to be fused based on the data logic description information; extracting second keywords from the data dictionary table based on the word frequency of the fields in the data dictionary table, and fusing the first keywords and the second keywords to obtain target keywords; calculating a two-dimensional weighted editing distance of the target keyword, and performing primary classification on the data dictionary table to obtain a coarse category of the data dictionary table; in the same rough category, calculating a multidimensional weighted editing distance of the field, and reclassifying the data dictionary table to obtain a fine category of the data dictionary table; and in the same subcategory, calculating the table similarity of the data dictionary table, and fusing the data dictionary table according to the table similarity. The method and the device solve the problems of high personnel cost, long period and the like in the scheme of modifying the design document of the data dictionary by relying on the designer.

Description

Multi-source data dictionary fusion method and device, medium and equipment
Technical Field
The present invention relates to the field of data processing, and in particular, to a method, an apparatus, a medium, and a device for multi-source data dictionary fusion.
Background
In addition to user data, the database is designed to have a lot of non-user data information. Such as the length, type, user identifier, primary/foreign keys, data and file-to-file connections, etc., of data items, which are informative systems of the overall database system, in order to have a common standard and basis for database design, implementation, operation, maintenance, augmentation, and also to ensure database sharing, security, integrity, consistency, validity, restorability and scalability, one keeps such information in a data dictionary. In recent years, in the field of basic society management at home and abroad, a data dictionary is used as an auxiliary management tool.
However, when the application range of the data dictionary changes or two or even a plurality of data dictionaries are required to finish the business together, as the data dictionary does not have unified data standards and naming standards, the problems of data inconsistency, redundancy and the like exist, for example, the data dictionary designed by a database in a government department or a company has ambiguity and redundancy with the design of an environment to be applied in cross-department linkage, and therefore, in the online stage of a joint system, the data dictionary which is already designed and applied needs to be adapted to a new system when being integrated into a specified linkage system, so as to ensure the accuracy of the operation of the system. In an actual service scene, a solution scheme of modifying design documents of a data dictionary by relying on original data dictionary designers and carrying out system adaptation optimization again is generally adopted, so that the problem of system capacity reduction when the data dictionary is adapted to a new system is realized, but the scheme has the actual problems of high personnel cost, long period and the like, and is difficult to meet the actual service demands.
Disclosure of Invention
In view of the above, the application provides a multi-source data dictionary fusion method, a multi-source data dictionary fusion device, a multi-source data dictionary fusion medium and multi-source data dictionary fusion equipment, and solves the problems of high personnel cost, long period and the like in a scheme of modifying design documents of a data dictionary by relying on original data dictionary designers.
According to one aspect of the present application, there is provided a multi-source data dictionary fusion method, including:
acquiring a plurality of data dictionaries to be fused and data logic description information corresponding to each data dictionary to be fused;
based on the data logic description information, extracting a first keyword from a data dictionary table of each data dictionary to be fused, wherein the data logic description information at least comprises one of the following: table names, primary keys, entity relations and labels of the data dictionary tables;
extracting a second keyword from the data dictionary table based on the word frequency of each field in the data dictionary table, and fusing the first keyword and the second keyword to obtain a target keyword;
calculating two-dimensional weighted editing distances among target keywords in different data dictionaries to be fused, and performing primary classification on the data dictionary tables corresponding to the target keywords according to the two-dimensional weighted editing distances to obtain coarse categories corresponding to each data dictionary table;
In the same rough category, calculating a multidimensional weighted editing distance between the fields, and reclassifying the data dictionary tables corresponding to the fields according to the multidimensional weighted editing distance to obtain a fine category corresponding to each data dictionary table;
and in the same subcategory, calculating the table similarity between different data dictionary tables, and fusing the data dictionary tables according to the table similarity to obtain the target data dictionary.
Optionally, calculating a two-dimensional weighted editing distance between target keywords in different data dictionaries to be fused includes:
reading preset weight mapping information to obtain preset weights corresponding to the target keywords;
based on the data logic description information, adjusting the preset weights to obtain first weights corresponding to the target keywords;
and calculating two-dimensional weighted editing distances among target keywords in different data dictionaries to be fused according to the first weights.
Optionally, in the same coarse category. Calculating a multidimensional weighted edit distance between the fields, comprising:
in the same rough category, judging whether each field in the data dictionary table is the target keyword or not respectively, and acquiring design information of each field, wherein the design information at least comprises a field type;
And determining a second weight of each field according to the judgment result corresponding to each field and the design information, and calculating the multidimensional weighted editing distance between the fields according to the second weight.
Optionally, before calculating the multidimensional weighted edit distance between the fields according to the second weight, the method further comprises:
and determining an adjustment factor of each target keyword, and adjusting the second weight according to the adjustment factors, wherein the adjustment factors comprise a language system dimension adjustment factor and/or a text dimension adjustment factor.
Optionally, fusing the data dictionary tables according to the table similarity to obtain a target data dictionary, including:
if the similarity of the table is greater than a first similarity threshold, fusing the data dictionary table, and obtaining the target data dictionary according to the fused target data dictionary table;
and if the similarity of the table is smaller than the first similarity threshold and larger than the second similarity threshold, extracting an intersection and a difference set of the data dictionary table, and obtaining the target data dictionary according to the intersection and the difference set.
Optionally, fusing the data dictionary tables according to the table similarity to obtain a target data dictionary, and further including:
If the table similarity is smaller than a second similarity threshold, ending the merging and generating prompt information, wherein the prompt information is used for indicating that the data dictionary similarity is lower; or alternatively, the first and second heat exchangers may be,
and if the table similarity is smaller than a second similarity threshold, returning to the step of acquiring the design information of each field to acquire new design information, wherein the new design information comprises a field type, a field length and a default value.
Optionally, after acquiring the plurality of data dictionaries to be fused, the method further comprises:
and preprocessing the data dictionary to be fused, wherein the preprocessing at least comprises entity disambiguation.
According to another aspect of the present application, there is provided a multi-source data dictionary fusion apparatus, the apparatus comprising:
the acquisition module is used for acquiring a plurality of data dictionaries to be fused and data logic description information corresponding to each data dictionary to be fused;
the feature extraction module is used for respectively extracting first keywords from the data dictionary tables of each data dictionary to be fused based on the data logic description information, wherein the data logic description information at least comprises one of the following: table names, primary keys, entity relations and labels of the data dictionary tables; extracting a second keyword from the data dictionary table based on the word frequency of each field in the data dictionary table, and fusing the first keyword and the second keyword to obtain a target keyword;
The classification module is used for calculating two-dimensional weighted distances between target keywords in different data dictionaries to be fused, and carrying out primary classification on the data dictionary tables corresponding to the target keywords according to the two-dimensional weighted distances to obtain coarse categories corresponding to each data dictionary table; in the same rough category, calculating a multidimensional weighted editing distance between the fields, and reclassifying the data dictionary tables corresponding to the fields according to the multidimensional weighted editing distance to obtain a fine category corresponding to each data dictionary table;
and the fusion module is used for fusing the data dictionary tables according to the table similarity in the same fine category to obtain a target data dictionary.
Optionally, the classification module is configured to:
reading preset weight mapping information to obtain preset weights corresponding to the target keywords;
based on the data logic description information, adjusting the preset weights to obtain first weights corresponding to the target keywords;
and calculating a first weighted editing distance between target keywords in different data dictionaries to be fused according to the first weight.
Optionally, the classification module is configured to:
In the same rough category, judging whether each field in the data dictionary table is the target keyword or not respectively, and acquiring design information of each field, wherein the design information at least comprises a field type;
and determining a second weight of each field according to the judgment result corresponding to each field and the design information, and calculating the multidimensional weighted editing distance between the fields according to the second weight.
Optionally, the classification module is configured to:
and determining an adjustment factor of each target keyword, and adjusting the second weight according to the adjustment factors, wherein the adjustment factors comprise a language system dimension adjustment factor and/or a text dimension adjustment factor.
Optionally, the fusion module is configured to:
if the similarity of the table is greater than a first similarity threshold, fusing the data dictionary table, and obtaining the target data dictionary according to the fused target data dictionary table;
and if the similarity of the table is smaller than the first similarity threshold and larger than the second similarity threshold, extracting an intersection and a difference set of the data dictionary table, and obtaining the target data dictionary according to the intersection and the difference set.
Optionally, fusing the data dictionary tables according to the table similarity to obtain a target data dictionary, and further including:
if the table similarity is smaller than a second similarity threshold, ending the merging and generating prompt information, wherein the prompt information is used for indicating that the data dictionary similarity is lower; or alternatively, the first and second heat exchangers may be,
and if the table similarity is smaller than a second similarity threshold, returning to the step of acquiring the design information of each field to acquire new design information, wherein the new design information comprises a field type, a field length and a default value.
Optionally, the apparatus further comprises a preprocessing module for:
and preprocessing the data dictionary to be fused, wherein the preprocessing at least comprises entity disambiguation.
According to yet another aspect of the present application, there is provided a medium having stored thereon a program or instructions which, when executed by a processor, implement the above-described multi-source data dictionary fusion method.
According to a further aspect of the present application, there is provided an apparatus comprising a storage medium storing a computer program and a processor implementing the above-mentioned multi-source data dictionary fusion method when the processor executes the computer program.
By means of the technical scheme, the characteristic of a special data structure of the data dictionary is considered, the characteristic extraction is realized through two means of data logic description information and TF-IDF, a first keyword and a second keyword are respectively obtained, the characteristic extraction method based on the data logic description information takes the characteristic of the data structure layer into consideration, the characteristic extraction method based on the TF-IDF takes the characteristic that an extracted object is text into consideration, and the two means are combined to better accord with the special characteristic extraction scene of keyword extraction in the data dictionary; and then, classifying the data dictionary table by using the extracted keywords, wherein a two-stage weighted editing distance algorithm is adopted in the specific classification process and is respectively used for realizing primary classification and reclassification, so that data dictionary fusion is carried out in the same class. Through two-dimensional and then multidimensional twice weighted editing distance calculation, the operation efficiency is greatly improved, and meanwhile, the problem that the deviation of a calculation result occurs due to the fact that the dimension unsuitable for calculating the similarity is introduced due to the fact that the weighted editing distance is directly adopted is avoided. Therefore, the method for automatically fusing the data dictionary does not need to modify the design document of the data dictionary manually, so that the problems of high personnel cost, long period and the like are effectively solved, and the labor cost and the time cost are reduced.
The foregoing description is only an overview of the technical solutions of the present application, and may be implemented according to the content of the specification in order to make the technical means of the present application more clearly understood, and in order to make the above-mentioned and other objects, features and advantages of the present application more clearly understood, the following detailed description of the present application will be given.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:
fig. 1 shows a flow diagram of a multi-source data dictionary fusion method according to an embodiment of the present application;
FIG. 2 is a flowchart illustrating another method for multi-source data dictionary fusion according to an embodiment of the present application;
FIG. 3 is a schematic diagram of a rough classification result of another multi-source data dictionary fusion method according to an embodiment of the present application;
FIG. 4 is a diagram showing a fine classification result of another multi-source data dictionary fusion method according to an embodiment of the present application;
FIG. 5 is a flowchart illustrating another method for multi-source data dictionary fusion according to an embodiment of the present application;
Fig. 6 shows a block diagram of a multi-source data dictionary fusion device according to an embodiment of the present application.
Detailed Description
The present application will be described in detail hereinafter with reference to the accompanying drawings in conjunction with embodiments. It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be combined with each other.
In this embodiment, a multi-source data dictionary fusion method is provided, as shown in fig. 1, and the method includes:
step 101, acquiring a plurality of data dictionaries to be fused and data logic description information corresponding to each data dictionary to be fused;
step 102, based on the data logic description information, extracting a first keyword from the data dictionary table of each data dictionary to be fused, wherein the data logic description information at least comprises one of the following: table names, primary keys, entity relationships and labels of the data dictionary table;
step 103, extracting a second keyword from the data dictionary table based on the word frequency of each field in the data dictionary table, and fusing the first keyword and the second keyword to obtain a target keyword;
104, calculating two-dimensional weighted editing distances among target keywords in different data dictionaries to be fused, and primarily classifying the data dictionary tables corresponding to the target keywords according to the two-dimensional weighted editing distances to obtain coarse categories corresponding to each data dictionary table;
Step 105, calculating multidimensional weighted editing distances among the fields in the same coarse category, and reclassifying the data dictionary tables corresponding to the fields according to the multidimensional weighted editing distances to obtain fine categories corresponding to each data dictionary table;
and 106, calculating the table similarity among different data dictionary tables in the same subcategory, and fusing the data dictionary tables according to the table similarity to obtain the target data dictionary.
The multi-source data dictionary fusion method provided by the embodiment of the application is used for fusion among different data dictionaries, for example, in the field of home and abroad base social management, the data dictionaries adopted in different cities may have the condition of non-uniform table design, so that the data dictionary fusion method can be used for fusing tables with different designs into an integral design so as to ensure that the original data dictionary is adapted to a new integrated system.
In this embodiment, at least two data dictionaries to be fused are first obtained, and data logic description information corresponding to each data dictionary to be fused is obtained, where the data logic description information may include a table name, a primary key, an entity relationship, a label, and the like of the data dictionary table.
The embodiment utilizes the data logic description information and the keyword extraction technology to extract the features, and specifically, the data logic description information can be utilized to extract the first keywords, for example, the toxic person and the criminal full release are the people with important attention, so that the data dictionary tables named as the toxic person and the criminal full release can be extracted, and the fields corresponding to the hotel accommodation labels are extracted from the tables as the first keywords. In addition, the extraction of the second keyword is performed using a keyword extraction technique TF-IDF (Term Frequency-inverse document Frequency). TF-IDF is a common weighting technique for finding keywords, which can be used to evaluate the importance of a field in text, and by using TF-IDF to filter field keywords, words with higher weight and importance in the field can be identified to assist in data analysis, searching, and text mining.
After the keywords are extracted, the similarity of the data dictionary tables is determined based on the keywords, and then the data dictionary tables with higher similarity are fused. The traditional text is characterized by overall coherence, context correlation, repeatable fields, multiple stop words and the like due to the huge performance difference between the data dictionary and the traditional text; the data dictionary is characterized in that fields are generally unrepeatable, uplink and downlink are irrelevant, and the like, so that the embodiment adopts a weighted editing distance algorithm to calculate the similarity so as to realize fusion. Specifically, a method of calculating the weighted editing distance for two times can be adopted, primary classification is realized by first calculation, and the two-dimensional weighted editing distance between target keywords is calculated to obtain the coarse category of the data dictionary table corresponding to each target keyword; and (3) performing secondary calculation to realize refined classification, and further calculating multidimensional weighted editing distances in each coarse category obtained by the primary classification to obtain a fine category corresponding to each data dictionary table, thereby realizing fusion of the data dictionary tables in the same fine category. For example, the data dictionary tables of the same fine category can be fused, while the data dictionary tables of different fine categories are not fused; the similarity of the tables can be further calculated in the same subcategory, and the tables with higher similarity are fused. By the method, the calculated amount can be reduced, and the operation efficiency can be improved.
It will be appreciated that the minimum edit distance algorithm is a classical algorithm for solving text similarity, which is based on the calculation of the minimum number of edit operations required to convert one symbol string to another. The specific gravity of adding, deleting and modifying characters among the character strings is the same for the minimum editing distance which is not weighted. Considering that the importance degree of words is different, the weighted minimum editing distance is adopted to give weight to different words.
The formula for weighting edit distances is as follows:
wherein lev is a,b (i, j) represents the weighted edit distance between the first i characters of a and the first j characters of b, del represents deletion, ins represents insertion, sub represents substitution.
In a specific application process, since all dimensions are not suitable for calculating the similarity, different dimensions can be adopted for calculating the twice weighted editing distance, for example, the first calculation adopts a two-dimensional weighted editing distance, analysis is performed only for the dimension suitable for calculating the similarity of all tables, the second calculation adopts a multi-dimensional weighted editing distance, and the commonality of the data dictionary tables in the coarse category is analyzed for the result after the first classification, the dimension suitable for calculating the similarity of the data dictionary tables is determined, and then the analysis is performed for the dimensions.
The embodiment provides a data dictionary fusion method aiming at the professional field, which can be used in the fields of home and abroad base social management field and the like, and takes the characteristics of a special data structure of a data dictionary into consideration, wherein the characteristic extraction is realized by two means of data logic description information and TF-IDF, a first keyword and a second keyword are respectively obtained, the characteristic extraction method based on the data logic description information takes the characteristics of a data structure layer into consideration, the characteristic extraction method based on the TF-IDF takes the characteristic that an extracted object is text into consideration, and the two means are combined to better accord with the special characteristic extraction scene of keyword extraction in the data dictionary; and then, classifying the data dictionary table by using the extracted keywords, wherein a two-stage weighted editing distance algorithm is adopted in the specific classification process and is respectively used for realizing primary classification and reclassification, so that data dictionary fusion is carried out in the same class. Through two-dimensional and then multidimensional twice weighted editing distance calculation, the operation efficiency is greatly improved, and meanwhile, the problem that the deviation of a calculation result occurs due to the fact that the dimension unsuitable for calculating the similarity is introduced due to the fact that the weighted editing distance is directly adopted is avoided. Therefore, the method for automatically fusing the data dictionary does not need to modify the design document of the data dictionary manually, so that the problems of high personnel cost, long period and the like are effectively solved, and the labor cost and the time cost are reduced.
Further, as a refinement and extension of the foregoing embodiment, for fully explaining the implementation procedure of the embodiment, another multi-source data dictionary fusion method is provided, as shown in fig. 2, and the method includes the following steps:
step 201, obtaining a plurality of data dictionaries to be fused and data logic description information corresponding to each data dictionary to be fused, and preprocessing the data dictionaries to be fused, wherein the preprocessing at least comprises entity disambiguation.
In the step, after the data dictionary to be fused and the corresponding data logic description information thereof are acquired, the data dictionary to be fused is preprocessed so as to improve the accuracy and effect of fusion. The preprocessing comprises entity disambiguation, and unified standardization of ambiguous words in the data dictionary to be processed. It will be appreciated that different names or aliases may exist for the same real entity, such as telephone numbers and contact addresses, which are commonly referred to in community grids as the real entity telephone numbers, but point to two different fields, thus requiring unified standardization of the two entity designations. In particular, it can be implemented by an algorithm based on word similarity or context, where entities of different names are directed to the same unique identifier to disambiguate the entities.
In addition, the preprocessing may also include data cleaning and normalization operations, which may repair spelling errors, remove special characters, unify case, resolve format inconsistencies, etc., for non-canonical fields or words with errors, redundancies, etc., to obtain high quality and consistent field data. Preprocessing may also include data normalization operations, which may perform field unit unification, value range mapping, etc. for data of different sources or different formats. Through data standardization, the fused data can be more easily managed and compared in a unified way. In addition, the preprocessing can also comprise operations such as constructing an associated word list and the like, and the associated word list is constructed to specify association relations among words according to related documents, domain knowledge or expert opinions. This allows for the processing of specific words or phrases by associating word lists to ensure that relevant fields in the data dictionary are correctly identified and matched.
Step 202, extracting a first keyword from a data dictionary table of each data dictionary to be fused based on the data logic description information, wherein the data logic description information at least comprises one of the following: table names, primary keys, entity relationships and labels of the data dictionary table; and extracting a second keyword from the data dictionary table based on the word frequency of each field in the data dictionary table, and fusing the first keyword and the second keyword to obtain a target keyword.
Step 203, reading preset weight mapping information to obtain a preset weight corresponding to the target keyword; based on the data logic description information, adjusting a preset weight to obtain a first weight corresponding to each target keyword; and calculating two-dimensional weighted editing distances among the target keywords in different data dictionaries to be fused according to the first weights.
And 204, performing primary classification on the data dictionary tables corresponding to the target keywords according to the two-dimensional weighted editing distance to obtain coarse categories corresponding to each data dictionary table.
In steps 203-204, a corresponding weight, i.e., a first weight, is given to each target keyword, and then a weighting process is performed by using the first weight, so as to obtain a two-dimensional weighted editing distance between every two target keywords. Further, the data dictionary tables are classified according to the two-dimensional weighted editing distance, and it is understood that the smaller the two-dimensional weighted distance between two target keywords is, the higher the likelihood that the types of the two target keywords are identical. Based on the above, a distance threshold value can be preset, target keywords with two-dimensional weighted distances smaller than the distance threshold value are extracted, and the corresponding data dictionary tables are classified into the same rough category.
FIG. 3 is a schematic diagram of a rough classification result according to an embodiment of the present application, where rows and columns corresponding to circled cells are target keywords classified into the same rough category, for example, a two-dimensional weighted editing distance between "address building address code" and "identity card number" is 12, and the two distances are too large to be classified into the same rough category. The two-dimensional weighted editing distance between the population basic identity card number and the identity card number is 6, and the two distances are smaller, so that the population basic identity card number and the identity card number are classified into the same rough category. The final samples were divided into three broad categories of residents, houses, and corporate persons.
In a specific application process, the weight mapping information may be a weight table based on statistics, or may be a preset certain operation rule, for example, a field in a community grid data dictionary, where the weight of a field from the national standard is greater than the source local standard. For example, the weight of the keywords in the national standard fields such as community codes, community names, community boundaries, etc. is higher than that in the local standard fields such as community classification, community responsible person, community facilities, etc. After determining the preset weight according to the weight mapping information, the preset information may be adjusted according to the data logic description information, for example, when the two-dimensional representation form of the name+the primary key is faced, a 1.5 times weight is given to the name to obtain the final first weight.
Step 205, in the same rough category, judging whether each field in the data dictionary table is a target keyword or not, and obtaining design information of each field, wherein the design information at least comprises a field type; and determining a second weight of the fields according to the judgment result and the design information corresponding to each field, and calculating a multidimensional weighted editing distance between the fields according to the second weight.
And step 206, reclassifying the data dictionary tables corresponding to the fields according to the multidimensional weighted editing distance to obtain the fine class corresponding to each data dictionary table.
In step 205-step 206, a corresponding weight, i.e., a second weight, is given to each field, and then a weighting process is performed by using the second weight, so as to obtain a multidimensional weighted editing distance between every two fields. Further, the data dictionary tables are classified according to the multidimensional weighted editing distance, and it is understood that the smaller the multidimensional weighted distance between two fields is, the higher the likelihood that the types are identical. Based on the above, a distance threshold value can be preset, a field with a multidimensional weighted distance smaller than the distance threshold value is extracted, and the corresponding data dictionary table is classified into the same thin class. The distance threshold value corresponding to the coarse category may be the same as or different from the distance threshold value corresponding to the fine category. For example, if the distance weight is set to 5, the multidimensional weighted distance is set to 4.5, and the same subclass is determined to be similar.
Fig. 4 is a schematic diagram of a fine classification result according to an embodiment of the present application, where the fine classification result is a target keyword under a coarse category of residents, and the row and column corresponding to the circled unit cell is a target keyword classified into the same fine category, for example, the multidimensional weighted editing distance between the "population basic information table id number" and the "identity document information table id number" is 4.5, and the distances between the two are smaller, so that the two are not classified into the same fine category. The multidimensional weighted editing distance between the population basic information table identity card number and the resident expansion information table identity card number is 14.5, and the population basic information table identity card number and the resident expansion information table identity card number are large in distance and are not classified into the same rough category.
In a specific application, the second weight may be determined based on whether the field is a target keyword, and design information of the field. For example, the keyword may be: common fields: field type = 6:3: and (3) assigning the weight proportion of 1.
Wherein in step 205, the second weights may also be adjusted before calculating the multidimensional weighted edit distance between the fields based on the second weights. For example, an adjustment factor for each target keyword is determined and the second weight is adjusted according to the adjustment factor, wherein the adjustment factor includes a lineage dimension adjustment factor and/or a text dimension adjustment factor.
In this step, since there is a great difference in expression form between the long text and the data dictionary table, a weighted calculation method of different dimensions can be added, and the second weight can be adjusted from a plurality of different dimensions. For example, a language system dimension adjustment factor of the language system dimension is determined, and the second weight is adjusted based on the language system dimension adjustment factor, so that the weights of Chinese characters and English are increased, and the weight of numbers is reduced. A text dimension adjustment factor for the text dimension can also be determined, and a second weight is adjusted based on the text dimension adjustment factor, so that the weight of the vocabulary is increased, and the weight of the short text is reduced.
In step 207, table similarity between different data dictionary tables is calculated in the same subcategory.
Step 208, if the table similarity is greater than the first similarity threshold, merging the data dictionary tables, and obtaining a target data dictionary according to the merged target data dictionary tables; if the similarity of the table is smaller than the first similarity threshold and larger than the second similarity threshold, extracting an intersection and a difference set of the data dictionary table, and obtaining a target data dictionary according to the intersection and the difference set; if the similarity of the table is smaller than a second similarity threshold, ending the merging and generating prompt information, wherein the prompt information is used for indicating that the similarity of the data dictionary is lower; or if the table similarity is smaller than the second similarity threshold, returning to the step of acquiring the design information of each field to acquire new design information, wherein the new design information comprises a field type, a field length and a default value.
In steps 207-208, table similarity between every two data dictionary tables in the same subcategory is determined based on the multi-dimensional weighted edit distance between fields, and table fusion is performed based on the subcategory result. Specifically, a first similarity threshold and a second similarity threshold can be preset according to historical experience and actual application scenes, three different intervals are divided by using the first similarity threshold and the second similarity threshold, and the second similarity is judged to fall in which interval range, so that a corresponding fusion method is adopted.
For example, the first similarity may be set to be 70%, the second similarity may be set to be 30%, and if the similarity of the data dictionary table a and the data dictionary table b is greater than or equal to 70%, the two data dictionary tables are considered to be higher in similarity, the data dictionary table a and the data dictionary table are directly fused into the data dictionary table c, and at this time, the number change of the data dictionary tables is expressed as 1+1=1, that is, the two data dictionary tables are fused, so that a new data dictionary table is obtained. If the similarity between the data dictionary table a and the data dictionary table b is between 30% and 70%, the two data tables are considered to have certain similarity, at this time, the intersection d of the data dictionary table a and the data dictionary table b is taken, the difference b 'between the difference a' and d and the difference b 'between the difference a' and d are taken, d, a 'and d' are taken as new data dictionary tables, at this time, the number change of the data dictionary tables is represented as 1+1=3, that is, the two data dictionary tables are fused, and the result is 3 new data dictionary tables.
If the similarity of the data dictionary table a and the data dictionary table b is less than 30%, the similarity between the two data dictionary tables is considered to be not high, at this time, the fusion operation of the two data dictionary tables can be directly ended, corresponding prompt information is generated, new design types, such as field types, field lengths, default values and the like, can be obtained again, similarity calculation of the target keywords is carried out again, and the fusion operation of the data dictionary tables is carried out again when the similarity meets the conditions.
Fig. 5 shows a data dictionary fusion flowchart of an embodiment of the present application, where the overall flowchart includes the steps of inputting a data dictionary to be fused, disambiguating data items of the data dictionary, extracting semantic features of the data dictionary, calculating similarity, and fusing the data dictionary. In the step of inputting the data dictionary to be fused, a plurality of data dictionaries and corresponding data logic description files can be input, and a similarity algorithm to be adopted in the step of calculating the similarity can be determined; in the step of disambiguation of the data items of the data dictionary, a pre-trained model can be utilized to execute specific disambiguation operation, and the method specifically comprises the steps of constructing a synonym dictionary, training word vectors, neighbor searching, entity disambiguation and the like, and finally, a entity model with unified semantics is obtained; in the data dictionary semantic feature extraction step, keywords are screened from a solid model based on unified semantics, wherein the keywords can be screened by adopting a TF-IDF method respectively, the keywords can be screened based on data logic description information such as table names, primary keys, entity relations, labels and the like, and the two extraction methods can be combined for use; in the similarity calculation stage, calculating the similarity based on the extracted keywords, weighting the keywords, and sequentially performing two-dimensional weighted editing distance preliminary calculation to realize the primary classification of the data dictionary table and the classification thereof so as to improve the classification efficiency, performing similarity judgment in the same class, not performing fusion operation if the similarity is lower than a threshold value, and entering a data dictionary fusion step if the similarity is higher than the threshold value; in the data dictionary fusion step, based on different similarities, sql sentences can be utilized to process on the basis of the original table, and a new data dictionary table is constructed at the same time so as to realize data dictionary fusion.
It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic, and should not limit the implementation process of the embodiment of the present invention.
Further, as a specific implementation of the multi-source data dictionary fusion method, an embodiment of the present application provides a multi-source data dictionary fusion device, as shown in fig. 6, where the device includes: the device comprises an acquisition module, a characteristic extraction module, a classification module and a fusion module.
The acquisition module is used for acquiring a plurality of data dictionaries to be fused and data logic description information corresponding to each data dictionary to be fused;
the feature extraction module is used for respectively extracting first keywords from the data dictionary tables of each data dictionary to be fused based on the data logic description information, wherein the data logic description information at least comprises one of the following: table names, primary keys, entity relationships and labels of the data dictionary table; extracting a second keyword from the data dictionary table based on the word frequency of each field in the data dictionary table, and fusing the first keyword and the second keyword to obtain a target keyword;
The classification module is used for calculating two-dimensional weighted distances between target keywords in different data dictionaries to be fused, and carrying out primary classification on the data dictionary tables corresponding to the target keywords according to the two-dimensional weighted distances to obtain coarse categories corresponding to each data dictionary table; and in the same rough category, calculating a multidimensional weighted editing distance between the fields, and reclassifying the data dictionary tables corresponding to the fields according to the multidimensional weighted editing distance to obtain a fine category corresponding to each data dictionary table;
and the fusion module is used for fusing the data dictionary tables according to the table similarity in the same fine category to obtain the target data dictionary.
In a specific application scenario, optionally, the classification module is configured to:
reading preset weight mapping information to obtain preset weights corresponding to target keywords;
based on the data logic description information, adjusting a preset weight to obtain a first weight corresponding to each target keyword;
and calculating two-dimensional weighted editing distances among the target keywords in different data dictionaries to be fused according to the first weights.
In a specific application scenario, optionally, the classification module is configured to:
in the same rough category, judging whether each field in the data dictionary table is a target keyword or not respectively, and acquiring design information of each field, wherein the design information at least comprises a field type;
And determining a second weight of the fields according to the judgment result and the design information corresponding to each field, and calculating a multidimensional weighted editing distance between the fields according to the second weight.
In a specific application scenario, optionally, the classification module is configured to:
and determining an adjustment factor of each target keyword, and adjusting the second weight according to the adjustment factors, wherein the adjustment factors comprise a language system dimension adjustment factor and/or a text dimension adjustment factor.
In a specific application scenario, optionally, the fusion module is configured to:
if the similarity of the tables is greater than a first similarity threshold, merging the data dictionary tables, and obtaining a target data dictionary according to the merged target data dictionary tables;
and if the similarity of the tables is smaller than the first similarity threshold and larger than the second similarity threshold, extracting an intersection and a difference set of the data dictionary tables, and obtaining the target data dictionary according to the intersection and the difference set.
In a specific application scenario, optionally, the data dictionary table is fused according to the table similarity to obtain a target data dictionary, and the method further includes:
if the similarity of the table is smaller than a second similarity threshold, ending the merging and generating prompt information, wherein the prompt information is used for indicating that the similarity of the data dictionary is lower; or alternatively, the first and second heat exchangers may be,
If the table similarity is less than the second similarity threshold, returning to the step of obtaining the design information of each field to obtain new design information, wherein the new design information comprises a field type, a field length and a default value.
In a specific application scenario, optionally, the apparatus further includes a preprocessing module, configured to:
preprocessing the dictionary of data to be fused, wherein the preprocessing at least comprises entity disambiguation.
According to yet another aspect of the present application, there is provided a medium having stored thereon a program or instructions which, when executed by a processor, implement the above-described multi-source data dictionary fusion method.
It should be noted that, for other corresponding descriptions of each functional module related to the multi-source data dictionary fusion device provided in the embodiment of the present application, reference may be made to corresponding descriptions in the above method, which are not repeated herein.
Based on the above method, correspondingly, the embodiment of the application also provides a storage medium, on which a computer program is stored, and when the program is executed by a processor, the multi-source data dictionary fusion method is implemented.
Based on such understanding, the technical solution of the present application may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.), and includes several instructions for causing an electronic device (may be a personal computer, a server, or a network device, etc.) to perform the methods described in various implementation scenarios of the present application.
Based on the method shown in fig. 1 to 5 and the virtual device embodiment shown in fig. 6, in order to achieve the above object, the embodiment of the present application further provides an apparatus, which may specifically be a personal computer, a server, a network device, etc., where the electronic apparatus includes a storage medium and a processor; a storage medium storing a computer program; a processor for executing a computer program to implement the multi-source data dictionary fusion method as described above and shown in fig. 1-5.
Optionally, the electronic device may also include a user interface, a network interface, a camera, radio Frequency (RF) circuitry, sensors, audio circuitry, WI-FI modules, and the like. The user interface may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), etc., and the optional user interface may also include a USB interface, a card reader interface, etc. The network interface may optionally include a standard wired interface, a wireless interface (e.g., bluetooth interface, WI-FI interface), etc.
It will be appreciated by those skilled in the art that the structure of the electronic device provided in this embodiment is not limited to the electronic device, and may include more or fewer components, or may be combined with certain components, or may be arranged with different components.
The storage medium may also include an operating system, a network communication module. An operating system is a program that manages and saves electronic device hardware and software resources, supporting the execution of information handling programs, as well as other software and/or programs. The network communication module is used for realizing communication among all the controls in the storage medium and communication with other hardware and software in the entity equipment.
From the above description of the embodiments, it will be apparent to those skilled in the art that the present application may be implemented by means of software plus necessary general hardware platforms, or may be implemented by hardware.
Those skilled in the art will appreciate that the drawings are merely schematic illustrations of one preferred implementation scenario, and that the elements or processes in the drawings are not necessarily required to practice the present application. Those skilled in the art will appreciate that elements of an apparatus in an implementation may be distributed throughout the apparatus in an implementation as described in the implementation, or that corresponding variations may be located in one or more apparatuses other than the present implementation. The units of the implementation scenario may be combined into one unit, or may be further split into a plurality of sub-units.
The foregoing application serial numbers are merely for description, and do not represent advantages or disadvantages of the implementation scenario. The foregoing disclosure is merely a few specific implementations of the present application, but the present application is not limited thereto and any variations that can be considered by a person skilled in the art shall fall within the protection scope of the present application.

Claims (10)

1. A method of multi-source data dictionary fusion, the method comprising:
acquiring a plurality of data dictionaries to be fused and data logic description information corresponding to each data dictionary to be fused;
based on the data logic description information, extracting a first keyword from a data dictionary table of each data dictionary to be fused, wherein the data logic description information at least comprises one of the following: table names, primary keys, entity relations and labels of the data dictionary tables;
extracting a second keyword from the data dictionary table based on the word frequency of each field in the data dictionary table, and fusing the first keyword and the second keyword to obtain a target keyword;
calculating two-dimensional weighted editing distances among target keywords in different data dictionaries to be fused, and performing primary classification on the data dictionary tables corresponding to the target keywords according to the two-dimensional weighted editing distances to obtain coarse categories corresponding to each data dictionary table;
In the same rough category, calculating a multidimensional weighted editing distance between the fields, and reclassifying the data dictionary tables corresponding to the fields according to the multidimensional weighted editing distance to obtain a fine category corresponding to each data dictionary table;
and in the same subcategory, calculating the table similarity between different data dictionary tables, and fusing the data dictionary tables according to the table similarity to obtain the target data dictionary.
2. The method of claim 1, wherein calculating two-dimensional weighted edit distances between target keywords in different data dictionaries to be fused comprises:
reading preset weight mapping information to obtain preset weights corresponding to the target keywords;
based on the data logic description information, adjusting the preset weights to obtain first weights corresponding to the target keywords;
and calculating two-dimensional weighted editing distances among target keywords in different data dictionaries to be fused according to the first weights.
3. The method of claim 1, wherein in the same coarse category. Calculating a multidimensional weighted edit distance between the fields, comprising:
in the same rough category, judging whether each field in the data dictionary table is the target keyword or not respectively, and acquiring design information of each field, wherein the design information at least comprises a field type;
And determining a second weight of each field according to the judgment result corresponding to each field and the design information, and calculating the multidimensional weighted editing distance between the fields according to the second weight.
4. A method according to claim 3, wherein prior to calculating the multidimensional weighted edit distance between the fields from the second weights, the method further comprises:
and determining an adjustment factor of each target keyword, and adjusting the second weight according to the adjustment factors, wherein the adjustment factors comprise a language system dimension adjustment factor and/or a text dimension adjustment factor.
5. The method of claim 4, wherein fusing the data dictionary tables according to the table similarity to obtain a target data dictionary comprises:
if the similarity of the table is greater than a first similarity threshold, fusing the data dictionary table, and obtaining the target data dictionary according to the fused target data dictionary table;
and if the similarity of the table is smaller than the first similarity threshold and larger than the second similarity threshold, extracting an intersection and a difference set of the data dictionary table, and obtaining the target data dictionary according to the intersection and the difference set.
6. The method of claim 5, wherein fusing the data dictionary tables according to the table similarity results in a target data dictionary, further comprising:
if the table similarity is smaller than a second similarity threshold, ending the merging and generating prompt information, wherein the prompt information is used for indicating that the data dictionary similarity is lower; or alternatively, the first and second heat exchangers may be,
and if the table similarity is smaller than a second similarity threshold, returning to the step of acquiring the design information of each field to acquire new design information, wherein the new design information comprises a field type, a field length and a default value.
7. The method of claim 1, wherein after obtaining the plurality of data dictionaries to be fused, the method further comprises:
and preprocessing the data dictionary to be fused, wherein the preprocessing at least comprises entity disambiguation.
8. A multi-source data dictionary fusion apparatus, the apparatus comprising:
the acquisition module is used for acquiring a plurality of data dictionaries to be fused and data logic description information corresponding to each data dictionary to be fused;
the feature extraction module is used for respectively extracting first keywords from the data dictionary tables of each data dictionary to be fused based on the data logic description information, wherein the data logic description information at least comprises one of the following: table names, primary keys, entity relations and labels of the data dictionary tables; extracting a second keyword from the data dictionary table based on the word frequency of each field in the data dictionary table, and fusing the first keyword and the second keyword to obtain a target keyword;
The classification module is used for calculating two-dimensional weighted distances between target keywords in different data dictionaries to be fused, and carrying out primary classification on the data dictionary tables corresponding to the target keywords according to the two-dimensional weighted distances to obtain coarse categories corresponding to each data dictionary table; in the same rough category, calculating a multidimensional weighted editing distance between the fields, and reclassifying the data dictionary tables corresponding to the fields according to the multidimensional weighted editing distance to obtain a fine category corresponding to each data dictionary table;
and the fusion module is used for fusing the data dictionary tables according to the table similarity in the same fine category to obtain a target data dictionary.
9. A storage medium having stored thereon a program or instructions which, when executed by a processor, implement the method of any of claims 1 to 7.
10. An electronic device comprising a storage medium, a processor and a computer program stored on the storage medium and executable on the processor, characterized in that the processor implements the method of any one of claims 1 to 7 when executing the program.
CN202311345875.0A 2023-10-16 2023-10-16 Multi-source data dictionary fusion method and device, medium and equipment Pending CN117421421A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311345875.0A CN117421421A (en) 2023-10-16 2023-10-16 Multi-source data dictionary fusion method and device, medium and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311345875.0A CN117421421A (en) 2023-10-16 2023-10-16 Multi-source data dictionary fusion method and device, medium and equipment

Publications (1)

Publication Number Publication Date
CN117421421A true CN117421421A (en) 2024-01-19

Family

ID=89531803

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311345875.0A Pending CN117421421A (en) 2023-10-16 2023-10-16 Multi-source data dictionary fusion method and device, medium and equipment

Country Status (1)

Country Link
CN (1) CN117421421A (en)

Similar Documents

Publication Publication Date Title
CN109783651B (en) Method and device for extracting entity related information, electronic equipment and storage medium
WO2019091026A1 (en) Knowledge base document rapid search method, application server, and computer readable storage medium
CN111339421B (en) Information search method, device, equipment and storage medium based on cloud technology
CN111104794A (en) Text similarity matching method based on subject words
CN110929125B (en) Search recall method, device, equipment and storage medium thereof
US8606779B2 (en) Search method, similarity calculation method, similarity calculation, same document matching system, and program thereof
KR101195341B1 (en) Method and apparatus for determining category of an unknown word
US11907659B2 (en) Item recall method and system, electronic device and readable storage medium
CN110162768B (en) Method and device for acquiring entity relationship, computer readable medium and electronic equipment
CN112035599B (en) Query method and device based on vertical search, computer equipment and storage medium
CN111832289A (en) Service discovery method based on clustering and Gaussian LDA
CN102971729A (en) Ascribing actionable attributes to data that describes a personal identity
CN109063184B (en) Multi-language news text clustering method, storage medium and terminal device
CN113886604A (en) Job knowledge map generation method and system
CN110334343B (en) Method and system for extracting personal privacy information in contract
CN114722137A (en) Security policy configuration method and device based on sensitive data identification and electronic equipment
CN111444713B (en) Method and device for extracting entity relationship in news event
US8224642B2 (en) Automated identification of documents as not belonging to any language
CN113591476A (en) Data label recommendation method based on machine learning
CN111930949B (en) Search string processing method and device, computer readable medium and electronic equipment
CN116628173B (en) Intelligent customer service information generation system and method based on keyword extraction
CN111104422B (en) Training method, device, equipment and storage medium of data recommendation model
US9104755B2 (en) Ontology enhancement method and system
CN116108181A (en) Client information processing method and device and electronic equipment
CN115879901A (en) Intelligent personnel self-service platform

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination