CN113032360A

CN113032360A - Method for inferring database field meaning

Info

Publication number: CN113032360A
Application number: CN202110239741.5A
Authority: CN
Inventors: 唐弋钧; 聂敏; 杨磊; 李春
Original assignee: Sichuan Hanku Zhishu Technology Co ltd
Current assignee: Sichuan Hanku Zhishu Technology Co ltd
Priority date: 2021-03-04
Filing date: 2021-03-04
Publication date: 2021-06-25

Abstract

The invention provides a method for conjecturing the meaning of a database field, which can predict the meaning of the field with unknown meaning of the database according to the characteristics of the database field and an accumulated knowledge base. The method combines the knowledge base and the machine learning technology, and a plurality of methods guess the unknown Chinese annotation of the field, so that the real meaning of the unknown field can be better obtained, and good early work is performed for the next data management work.

Description

Method for inferring database field meaning

Technical Field

The invention particularly relates to a method for inferring the meaning of a database field.

Background

In the current society with highly developed informatization, various enterprises and public institutions have various informatization systems, but due to various reasons, databases in the informatization systems may have various problems such as lack of comments on database fields, incomplete database files, unused database fields and the like. These problems can lead to difficulties in upgrading systems, data governance, data analysis, system use, etc. Many business systems only overturn the reconstruction due to the problems, and a large amount of manpower, material resources and social resources are wasted.

The method conjectures the Chinese meaning of the unknown field of the database by various methods, and can confirm the conjecture accuracy of the unknown field by the mark of the user, thereby providing a new solution for the above problems and having great social significance.

Disclosure of Invention

The present invention is directed to a method for inferring the meaning of a database field, which can solve the above problems.

In order to meet the requirements, the technical scheme adopted by the invention is as follows: a method for inferring the meaning of a database field is provided, the method for inferring the meaning of a database field comprising the steps of:

s1: summarizing a common field knowledge base, and acquiring three common comments of common field names and scores of the comments;

s2: judging whether the field is English or English-like, if so, the meaning of the field is Chinese translation;

s3: judging the similarity between the field without Chinese annotation and other fields with Chinese annotation in the same database, wherein the field with high similarity has the same meaning;

s4: obtaining different Chinese annotations and scores thereof according to the method;

s5: the real meaning of the field is selected according to the recommended result, if the results are not in the user selection range, the user writes by self, and the result marked by the user affects the accuracy of subsequent recommendation.

The method for inferring the meaning of the database field has the following advantages:

the method combines the knowledge base and the machine learning technology, and a plurality of methods guess the unknown Chinese annotation of the field, so that the real meaning of the unknown field can be better obtained, and good early work is performed for the next data management work.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

fig. 1 schematically shows a flow diagram according to an embodiment of the application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be described in further detail with reference to the accompanying drawings and specific embodiments.

In the following description, references to "one embodiment," "an embodiment," "one example," "an example," etc., indicate that the embodiment or example so described may include a particular feature, structure, characteristic, property, element, or limitation, but every embodiment or example does not necessarily include the particular feature, structure, characteristic, property, element, or limitation. Moreover, repeated use of the phrase "in accordance with an embodiment of the present application" although it may possibly refer to the same embodiment, does not necessarily refer to the same embodiment.

Certain features that are well known to those skilled in the art have been omitted from the following description for the sake of simplicity.

According to an embodiment of the present application, there is provided a method for inferring the meaning of a database field, as shown in fig. 1, including the following steps:

s1 summarizes the common field knowledge base, obtains three common comments of the common field names and the scores of the comments.

S1.1, acquiring a batch of common database metadata information, uniformly processing database fields into capitals, and reserving the fields with Chinese notes, wherein the fields comprise fields Z1 and Z2 … Zi. And establishing a mapping table of the fields and the Chinese notes. As shown in table 1:

name of field (neglecting capital and small cases)	Chinese notes
		Z1	Name (I)
Z1	Name (R)
		Z1	Name of student
Z2	Sex
		Z3	Sex
Z4	College of academic
		Z4	School
....	....

TABLE 1

S1.2 a field may have multiple Chinese notes, for example, the field Zx has { x1, x2, x3 … … xn } n Chinese notes, then according to the statistical order of the occurrence times of the n Chinese notes, the top three Chinese notes are retained, and the scores are respectively assigned as a1, a2 and a3(a 1> a2> a 3).

S1.3 traverses all fields according to the rules of S1.2, leaving the field with chinese annotations therein with n (n < ═ 3) chinese meanings, and each chinese meaning has a corresponding score.

S1.4, Chinese meanings of i fields and scores corresponding to each Chinese meaning are obtained through the method, and the data are used as a knowledge base.

S1.5 if a database is provided, a field without Chinese annotation is arranged in the database, the corresponding name can be matched in the knowledge base through the field name, and the corresponding n Chinese meanings and the scores thereof can be found. This estimation method is denoted as method a.

S2, judging whether a certain field is English or English-like, if so, the meaning of the field can be the Chinese translation.

S2.1, acquiring a batch of common database metadata information, and uniformly processing database fields into capitals, wherein the database fields comprise fields Z1 and Z2 … Zi. And judging whether the fields are English words or not through codes, and if the fields are English words, acquiring the translated Chinese meaning through an online translation API.

S2.2 a field can have a plurality of Chinese translations, for example, the field Z1 has n Chinese translations { x1, x2, x3 … … xn }, then according to the statistical ordering of the occurrence times of n Chinese annotations, the top three Chinese annotations are reserved, and the scores are given as b1, b2 and b3(b1> b2> b 3).

S2.3 if there is a database, there is a field without Chinese annotation in it, the above method obtains Chinese annotation and Chinese annotation score for these fields, and this method is marked as method B.

S3 judges the similarity between a field without Chinese annotation and other fields with Chinese annotation in the same database, the field with higher similarity may have the same meaning. Fields such as id1 and id both represent the primary key.

S3.1, acquiring a field Z1 without Chinese comments to be treated as capital, and acquiring the editing times of the field Z2 with Chinese comments by editing distance, wherein the more times, the smaller the similarity value. The edit distance between the two is D1, and the edit distance is calculated as follows:

the Levenshtein Distance of two strings a, b is denoted as lev_a，b(|a | and b |), whose | a | neutralizes | b | correspond to the lengths of a and b, respectively. Then, here, the Levenshtein Distance, i.e., lev, of the two strings a, b_a，b(| a |, | b |) can be described in the following mathematical language:

the above formula is specifically set forth below:

definition lev_a，b(i, j) refers to the distance between the first i characters in a and the first j characters in b. For ease of understanding, i, j herein may be considered as a length. The first character index of the string starts from 1 (actually because the string needs to be complemented by 0 when operating on the table), so the last edit distance is the distance when i equals the absolute value of a, j equals the absolute value of b: lev_a，b(|a|，|b|)。

When min (i, j) is 0, corresponding to the first i characters in the character string a and the first j characters in the character string b, at this time, i, j has a value of 0, which indicates that one of the character strings a and b is an empty string, then only max (i, j) single character editing operations need to be performed when going from a to b, so the editing distance between them is max (i, j), i.e. the maximum of i, j.

S3.2Z 1 calculates the edit distance with all the fields with Chinese comments in the same database, obtains a group of edit distance sets { D1, D2, D3 … Dn }, sorts the n edit distances from small to large, retains the fields corresponding to the edit distances of the first three ranks and the Chinese comments thereof, and gives the values of c1, c2 and c3(c1> c2> c 3).

S3.3 if there is a database, there is a field without Chinese annotation in it, the above method obtains Chinese annotation and the weight of Chinese annotation for these fields, and this method is marked as method C.

S4, knowing a field Zx with unknown Chinese meaning, obtaining different Chinese comments and scores thereof according to the different methods.

S4.1 method a gave a1, a2, a 3; method B gave B1, B2, B3; method C gave C1, C2, C3.

S4.2 weights { x, y, z }. epsilon (0, 1) for method A, method B, and method C, respectively, depending on the effectiveness of method A, B, C. The score for method a should be multiplied by its weight, e.g., a1x, a2x, a3 x; the same applies to the method B and the method C.

S4.3, the scores of the Chinese annotations obtained by the field Zx are sorted according to a method A, B, C, the Chinese meaning possibility obtained by the field is represented by the sorted Chinese annotation sequence, and the Chinese annotations are recommended to the user according to the sort and the scores.

S5 the user selects the real meaning of the field according to the recommended result, the result is not in the user selection range, the user can write by self-definition, and the result marked by the user can influence the accuracy of the subsequent recommendation.

S5.1 if the fields of the same name of a plurality of databases are marked as a Chinese annotation for more than n times or the same Chinese annotation is customized, the mapping of the Chinese annotation and the fields is put into the knowledge base of the method A.

S5.2 if a certain field is customized to a certain name, the custom writing is put into the knowledge base of the party A, and the score is set to be the lowest score of all the mapped Chinese annotations of the current field.

According to one embodiment of the application, the method for inferring the meaning of the database field comprehensively evaluates the correct meaning of an unknown field by using a plurality of methods, meanwhile, the user confirms the Chinese meaning of the recommended field to improve the accuracy rate of subsequent recommendation, and judges the similarity of the database field by using the shearing distance, so that the Chinese meaning is inferred.

The above-mentioned embodiments only show some embodiments of the present invention, and the description thereof is more specific and detailed, but should not be construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the claims.

Claims

1. A method of inferring a meaning of a database field, comprising the steps of:

2. The method for inferring meaning of database fields according to claim 1, wherein said S1 is: summarizing a common field knowledge base, acquiring three common comments of common field names and scores of the comments, wherein the process is defined as a method A, and the method specifically comprises the following steps:

s1.1: acquiring metadata information of a database, uniformly processing fields of the database into uppercase, reserving the fields with Chinese annotations, wherein the fields comprise fields Z1 and Z2 … Zi, and establishing a mapping table of the fields and the Chinese annotations;

s1.2: one field has a plurality of Chinese notes, the Chinese notes are sorted according to the statistics of the occurrence times of the Chinese notes, the Chinese notes of the first three are reserved, the scores are respectively assigned as a1, a2 and a3, and a1> a2> a 3;

s1.3: traversing all the fields according to the rule of the step S1.2, and enabling the fields with Chinese annotations to keep n Chinese meanings, wherein n is less than or equal to 3, and each Chinese meaning has a corresponding score;

s1.4: acquiring the Chinese meaning of the i character fields and the score corresponding to each Chinese meaning through the steps, and taking the data as a knowledge base; if a database is available, a field without Chinese annotation is arranged in the database, the corresponding name is matched in the knowledge base through the field name, and the corresponding Chinese meaning and the score of the Chinese meaning are found.

3. The method for inferring the meaning of a database field according to claim 2, characterized in that step S2: judging whether the field is English or English-like, if the field is English or English-like, the field meaning is in the step of translating Chinese, if a database is provided, a field without Chinese annotation is arranged in the database, and the process of acquiring the Chinese annotation and Chinese annotation score of the field is defined as a method B, which is specifically as follows:

s2.1: acquiring metadata information of a database, uniformly processing database fields into capital comprising fields Z1 and Z2 … Zi, judging whether the fields are English words or not through codes, translating the fields if the fields are English words, and acquiring translated Chinese meanings;

s2.2: a field has multiple chinese translations, then the n chinese notes are statistically ordered by the number of occurrences, and the top three chinese notes are retained, with scores assigned to b1, b2, b3, and b1> b2> b 3.

4. The method for inferring the meaning of a database field according to claim 3, characterized in that step S3: judging the similarity between the field without Chinese annotation and other fields with Chinese annotation in the same database, wherein if one field without Chinese annotation exists in the database in the field with high similarity has the same meaning, the process of acquiring the Chinese annotation and the weight of the Chinese annotation for the fields is defined as method C, and the specific steps are as follows:

s3.1: acquiring a field Z1 without Chinese comments to be processed as capital, and a field Z2 with Chinese comments to be processed as capital, acquiring the editing times of the field Z1 without Chinese comments and the editing times of the field Z2 with Chinese comments by editing distance, wherein the more the times, the smaller the similarity value and the editing distance between the field Z and the field D1;

s3.2: z1 calculates the edit distance with all the Chinese annotated fields in the same database, obtains a group of edit distance sets { D1, D2, D3 … Dn }, sorts the n edit distances from small to large, retains the fields corresponding to the edit distances of the first three ranks and the Chinese annotations thereof, and gives the values of c1, c2 and c3, and c1> c2> c 3.

5. The method of claim 4, wherein the method further comprises: step S4: different Chinese annotations and the scores thereof are obtained according to the method, which comprises the following steps:

s4.1: method a gave a1, a2, a 3; method B gave B1, B2, B3; method C gave C1, C2, C3;

s4.2: according to the effectiveness of the method A, B, C, the method A, the method B and the method C are respectively given a weight { x, y, z }. epsilon (0, 1);

s4.3: the scores of the Chinese annotations obtained for the field are sorted according to method A, B, C, and the likelihood of Chinese meaning obtained for the field is indicated by the sorted Chinese annotation precedence order, and recommended to the user according to the sort and the scores.

6. The method of claim 5, wherein the step of inferring the meaning of the database field comprises the steps of: selecting the real meaning of the field according to the recommended result, if the results are not in the user selection range, the user writes by self, the result marked by the user can influence the accuracy of the subsequent recommendation, and the specific steps are as follows:

s5.1: if a plurality of databases, the field of the same name is marked as a Chinese annotation for more than n times or the same Chinese annotation is customized, and the mapping of the Chinese annotation and the field is put into the knowledge base of the method A;

s5.2: if a field is customized to a name, the custom writing is placed into the knowledge base of method A, and the score is set to the lowest score among all the mapped Chinese annotations for the current field.