CN113032360A - Method for inferring database field meaning - Google Patents
Method for inferring database field meaning Download PDFInfo
- Publication number
- CN113032360A CN113032360A CN202110239741.5A CN202110239741A CN113032360A CN 113032360 A CN113032360 A CN 113032360A CN 202110239741 A CN202110239741 A CN 202110239741A CN 113032360 A CN113032360 A CN 113032360A
- Authority
- CN
- China
- Prior art keywords
- chinese
- field
- fields
- meaning
- database
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 66
- 238000013519 translation Methods 0.000 claims description 7
- 230000014616 translation Effects 0.000 claims description 7
- 238000013507 mapping Methods 0.000 claims description 4
- 238000012545 processing Methods 0.000 claims description 4
- 230000000717 retained effect Effects 0.000 claims description 2
- 238000013523 data management Methods 0.000 abstract description 2
- 238000005516 engineering process Methods 0.000 abstract description 2
- 238000010801 machine learning Methods 0.000 abstract description 2
- 238000007405 data analysis Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000010008 shearing Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/166—Editing, e.g. inserting or deleting
- G06F40/169—Annotation, e.g. comment data or footnotes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
Abstract
The invention provides a method for conjecturing the meaning of a database field, which can predict the meaning of the field with unknown meaning of the database according to the characteristics of the database field and an accumulated knowledge base. The method combines the knowledge base and the machine learning technology, and a plurality of methods guess the unknown Chinese annotation of the field, so that the real meaning of the unknown field can be better obtained, and good early work is performed for the next data management work.
Description
Technical Field
The invention particularly relates to a method for inferring the meaning of a database field.
Background
In the current society with highly developed informatization, various enterprises and public institutions have various informatization systems, but due to various reasons, databases in the informatization systems may have various problems such as lack of comments on database fields, incomplete database files, unused database fields and the like. These problems can lead to difficulties in upgrading systems, data governance, data analysis, system use, etc. Many business systems only overturn the reconstruction due to the problems, and a large amount of manpower, material resources and social resources are wasted.
The method conjectures the Chinese meaning of the unknown field of the database by various methods, and can confirm the conjecture accuracy of the unknown field by the mark of the user, thereby providing a new solution for the above problems and having great social significance.
Disclosure of Invention
The present invention is directed to a method for inferring the meaning of a database field, which can solve the above problems.
In order to meet the requirements, the technical scheme adopted by the invention is as follows: a method for inferring the meaning of a database field is provided, the method for inferring the meaning of a database field comprising the steps of:
s1: summarizing a common field knowledge base, and acquiring three common comments of common field names and scores of the comments;
s2: judging whether the field is English or English-like, if so, the meaning of the field is Chinese translation;
s3: judging the similarity between the field without Chinese annotation and other fields with Chinese annotation in the same database, wherein the field with high similarity has the same meaning;
s4: obtaining different Chinese annotations and scores thereof according to the method;
s5: the real meaning of the field is selected according to the recommended result, if the results are not in the user selection range, the user writes by self, and the result marked by the user affects the accuracy of subsequent recommendation.
The method for inferring the meaning of the database field has the following advantages:
the method combines the knowledge base and the machine learning technology, and a plurality of methods guess the unknown Chinese annotation of the field, so that the real meaning of the unknown field can be better obtained, and good early work is performed for the next data management work.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
fig. 1 schematically shows a flow diagram according to an embodiment of the application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be described in further detail with reference to the accompanying drawings and specific embodiments.
In the following description, references to "one embodiment," "an embodiment," "one example," "an example," etc., indicate that the embodiment or example so described may include a particular feature, structure, characteristic, property, element, or limitation, but every embodiment or example does not necessarily include the particular feature, structure, characteristic, property, element, or limitation. Moreover, repeated use of the phrase "in accordance with an embodiment of the present application" although it may possibly refer to the same embodiment, does not necessarily refer to the same embodiment.
Certain features that are well known to those skilled in the art have been omitted from the following description for the sake of simplicity.
According to an embodiment of the present application, there is provided a method for inferring the meaning of a database field, as shown in fig. 1, including the following steps:
s1 summarizes the common field knowledge base, obtains three common comments of the common field names and the scores of the comments.
S1.1, acquiring a batch of common database metadata information, uniformly processing database fields into capitals, and reserving the fields with Chinese notes, wherein the fields comprise fields Z1 and Z2 … Zi. And establishing a mapping table of the fields and the Chinese notes. As shown in table 1:
name of field (neglecting capital and small cases) | Chinese notes |
Z1 | Name (I) |
Z1 | Name (R) |
Z1 | Name of student |
Z2 | Sex |
Z3 | Sex |
Z4 | College of academic |
Z4 | School |
.... | .... |
TABLE 1
S1.2 a field may have multiple Chinese notes, for example, the field Zx has { x1, x2, x3 … … xn } n Chinese notes, then according to the statistical order of the occurrence times of the n Chinese notes, the top three Chinese notes are retained, and the scores are respectively assigned as a1, a2 and a3(a 1> a2> a 3).
S1.3 traverses all fields according to the rules of S1.2, leaving the field with chinese annotations therein with n (n < ═ 3) chinese meanings, and each chinese meaning has a corresponding score.
S1.4, Chinese meanings of i fields and scores corresponding to each Chinese meaning are obtained through the method, and the data are used as a knowledge base.
S1.5 if a database is provided, a field without Chinese annotation is arranged in the database, the corresponding name can be matched in the knowledge base through the field name, and the corresponding n Chinese meanings and the scores thereof can be found. This estimation method is denoted as method a.
S2, judging whether a certain field is English or English-like, if so, the meaning of the field can be the Chinese translation.
S2.1, acquiring a batch of common database metadata information, and uniformly processing database fields into capitals, wherein the database fields comprise fields Z1 and Z2 … Zi. And judging whether the fields are English words or not through codes, and if the fields are English words, acquiring the translated Chinese meaning through an online translation API.
S2.2 a field can have a plurality of Chinese translations, for example, the field Z1 has n Chinese translations { x1, x2, x3 … … xn }, then according to the statistical ordering of the occurrence times of n Chinese annotations, the top three Chinese annotations are reserved, and the scores are given as b1, b2 and b3(b1> b2> b 3).
S2.3 if there is a database, there is a field without Chinese annotation in it, the above method obtains Chinese annotation and Chinese annotation score for these fields, and this method is marked as method B.
S3 judges the similarity between a field without Chinese annotation and other fields with Chinese annotation in the same database, the field with higher similarity may have the same meaning. Fields such as id1 and id both represent the primary key.
S3.1, acquiring a field Z1 without Chinese comments to be treated as capital, and acquiring the editing times of the field Z2 with Chinese comments by editing distance, wherein the more times, the smaller the similarity value. The edit distance between the two is D1, and the edit distance is calculated as follows:
the Levenshtein Distance of two strings a, b is denoted as leva,b(|a | and b |), whose | a | neutralizes | b | correspond to the lengths of a and b, respectively. Then, here, the Levenshtein Distance, i.e., lev, of the two strings a, ba,b(| a |, | b |) can be described in the following mathematical language:
the above formula is specifically set forth below:
definition leva,b(i, j) refers to the distance between the first i characters in a and the first j characters in b. For ease of understanding, i, j herein may be considered as a length. The first character index of the string starts from 1 (actually because the string needs to be complemented by 0 when operating on the table), so the last edit distance is the distance when i equals the absolute value of a, j equals the absolute value of b: leva,b(|a|,|b|)。
When min (i, j) is 0, corresponding to the first i characters in the character string a and the first j characters in the character string b, at this time, i, j has a value of 0, which indicates that one of the character strings a and b is an empty string, then only max (i, j) single character editing operations need to be performed when going from a to b, so the editing distance between them is max (i, j), i.e. the maximum of i, j.
S3.2Z 1 calculates the edit distance with all the fields with Chinese comments in the same database, obtains a group of edit distance sets { D1, D2, D3 … Dn }, sorts the n edit distances from small to large, retains the fields corresponding to the edit distances of the first three ranks and the Chinese comments thereof, and gives the values of c1, c2 and c3(c1> c2> c 3).
S3.3 if there is a database, there is a field without Chinese annotation in it, the above method obtains Chinese annotation and the weight of Chinese annotation for these fields, and this method is marked as method C.
S4, knowing a field Zx with unknown Chinese meaning, obtaining different Chinese comments and scores thereof according to the different methods.
S4.1 method a gave a1, a2, a 3; method B gave B1, B2, B3; method C gave C1, C2, C3.
S4.2 weights { x, y, z }. epsilon (0, 1) for method A, method B, and method C, respectively, depending on the effectiveness of method A, B, C. The score for method a should be multiplied by its weight, e.g., a1x, a2x, a3 x; the same applies to the method B and the method C.
S4.3, the scores of the Chinese annotations obtained by the field Zx are sorted according to a method A, B, C, the Chinese meaning possibility obtained by the field is represented by the sorted Chinese annotation sequence, and the Chinese annotations are recommended to the user according to the sort and the scores.
S5 the user selects the real meaning of the field according to the recommended result, the result is not in the user selection range, the user can write by self-definition, and the result marked by the user can influence the accuracy of the subsequent recommendation.
S5.1 if the fields of the same name of a plurality of databases are marked as a Chinese annotation for more than n times or the same Chinese annotation is customized, the mapping of the Chinese annotation and the fields is put into the knowledge base of the method A.
S5.2 if a certain field is customized to a certain name, the custom writing is put into the knowledge base of the party A, and the score is set to be the lowest score of all the mapped Chinese annotations of the current field.
According to one embodiment of the application, the method for inferring the meaning of the database field comprehensively evaluates the correct meaning of an unknown field by using a plurality of methods, meanwhile, the user confirms the Chinese meaning of the recommended field to improve the accuracy rate of subsequent recommendation, and judges the similarity of the database field by using the shearing distance, so that the Chinese meaning is inferred.
The above-mentioned embodiments only show some embodiments of the present invention, and the description thereof is more specific and detailed, but should not be construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the claims.
Claims (6)
1. A method of inferring a meaning of a database field, comprising the steps of:
s1: summarizing a common field knowledge base, and acquiring three common comments of common field names and scores of the comments;
s2: judging whether the field is English or English-like, if so, the meaning of the field is Chinese translation;
s3: judging the similarity between the field without Chinese annotation and other fields with Chinese annotation in the same database, wherein the field with high similarity has the same meaning;
s4: obtaining different Chinese annotations and scores thereof according to the method;
s5: the real meaning of the field is selected according to the recommended result, if the results are not in the user selection range, the user writes by self, and the result marked by the user affects the accuracy of subsequent recommendation.
2. The method for inferring meaning of database fields according to claim 1, wherein said S1 is: summarizing a common field knowledge base, acquiring three common comments of common field names and scores of the comments, wherein the process is defined as a method A, and the method specifically comprises the following steps:
s1.1: acquiring metadata information of a database, uniformly processing fields of the database into uppercase, reserving the fields with Chinese annotations, wherein the fields comprise fields Z1 and Z2 … Zi, and establishing a mapping table of the fields and the Chinese annotations;
s1.2: one field has a plurality of Chinese notes, the Chinese notes are sorted according to the statistics of the occurrence times of the Chinese notes, the Chinese notes of the first three are reserved, the scores are respectively assigned as a1, a2 and a3, and a1> a2> a 3;
s1.3: traversing all the fields according to the rule of the step S1.2, and enabling the fields with Chinese annotations to keep n Chinese meanings, wherein n is less than or equal to 3, and each Chinese meaning has a corresponding score;
s1.4: acquiring the Chinese meaning of the i character fields and the score corresponding to each Chinese meaning through the steps, and taking the data as a knowledge base; if a database is available, a field without Chinese annotation is arranged in the database, the corresponding name is matched in the knowledge base through the field name, and the corresponding Chinese meaning and the score of the Chinese meaning are found.
3. The method for inferring the meaning of a database field according to claim 2, characterized in that step S2: judging whether the field is English or English-like, if the field is English or English-like, the field meaning is in the step of translating Chinese, if a database is provided, a field without Chinese annotation is arranged in the database, and the process of acquiring the Chinese annotation and Chinese annotation score of the field is defined as a method B, which is specifically as follows:
s2.1: acquiring metadata information of a database, uniformly processing database fields into capital comprising fields Z1 and Z2 … Zi, judging whether the fields are English words or not through codes, translating the fields if the fields are English words, and acquiring translated Chinese meanings;
s2.2: a field has multiple chinese translations, then the n chinese notes are statistically ordered by the number of occurrences, and the top three chinese notes are retained, with scores assigned to b1, b2, b3, and b1> b2> b 3.
4. The method for inferring the meaning of a database field according to claim 3, characterized in that step S3: judging the similarity between the field without Chinese annotation and other fields with Chinese annotation in the same database, wherein if one field without Chinese annotation exists in the database in the field with high similarity has the same meaning, the process of acquiring the Chinese annotation and the weight of the Chinese annotation for the fields is defined as method C, and the specific steps are as follows:
s3.1: acquiring a field Z1 without Chinese comments to be processed as capital, and a field Z2 with Chinese comments to be processed as capital, acquiring the editing times of the field Z1 without Chinese comments and the editing times of the field Z2 with Chinese comments by editing distance, wherein the more the times, the smaller the similarity value and the editing distance between the field Z and the field D1;
s3.2: z1 calculates the edit distance with all the Chinese annotated fields in the same database, obtains a group of edit distance sets { D1, D2, D3 … Dn }, sorts the n edit distances from small to large, retains the fields corresponding to the edit distances of the first three ranks and the Chinese annotations thereof, and gives the values of c1, c2 and c3, and c1> c2> c 3.
5. The method of claim 4, wherein the method further comprises: step S4: different Chinese annotations and the scores thereof are obtained according to the method, which comprises the following steps:
s4.1: method a gave a1, a2, a 3; method B gave B1, B2, B3; method C gave C1, C2, C3;
s4.2: according to the effectiveness of the method A, B, C, the method A, the method B and the method C are respectively given a weight { x, y, z }. epsilon (0, 1);
s4.3: the scores of the Chinese annotations obtained for the field are sorted according to method A, B, C, and the likelihood of Chinese meaning obtained for the field is indicated by the sorted Chinese annotation precedence order, and recommended to the user according to the sort and the scores.
6. The method of claim 5, wherein the step of inferring the meaning of the database field comprises the steps of: selecting the real meaning of the field according to the recommended result, if the results are not in the user selection range, the user writes by self, the result marked by the user can influence the accuracy of the subsequent recommendation, and the specific steps are as follows:
s5.1: if a plurality of databases, the field of the same name is marked as a Chinese annotation for more than n times or the same Chinese annotation is customized, and the mapping of the Chinese annotation and the field is put into the knowledge base of the method A;
s5.2: if a field is customized to a name, the custom writing is placed into the knowledge base of method A, and the score is set to the lowest score among all the mapped Chinese annotations for the current field.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110239741.5A CN113032360A (en) | 2021-03-04 | 2021-03-04 | Method for inferring database field meaning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110239741.5A CN113032360A (en) | 2021-03-04 | 2021-03-04 | Method for inferring database field meaning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113032360A true CN113032360A (en) | 2021-06-25 |
Family
ID=76467503
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110239741.5A Pending CN113032360A (en) | 2021-03-04 | 2021-03-04 | Method for inferring database field meaning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113032360A (en) |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150074507A1 (en) * | 2013-07-22 | 2015-03-12 | Recommind, Inc. | Information extraction and annotation systems and methods for documents |
CN104715032A (en) * | 2015-03-12 | 2015-06-17 | 福建工程学院 | Mapping system and method of Chinese and English table name and field name of report system |
CN108509199A (en) * | 2018-03-09 | 2018-09-07 | 平安科技(深圳)有限公司 | Automatically generate the method, apparatus, equipment and storage medium of Chinese annotation |
CN111061742A (en) * | 2019-12-25 | 2020-04-24 | 北京数起科技有限公司 | Method and device for marking data and service system thereof |
CN111078671A (en) * | 2019-12-19 | 2020-04-28 | 北京启迪区块链科技发展有限公司 | Method, device, equipment and medium for modifying data table field |
CN111831624A (en) * | 2020-07-14 | 2020-10-27 | 北京三快在线科技有限公司 | Data table creating method and device, computer equipment and storage medium |
CN112181936A (en) * | 2019-07-03 | 2021-01-05 | 北京京东尚科信息技术有限公司 | Database detection method and device |
-
2021
- 2021-03-04 CN CN202110239741.5A patent/CN113032360A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150074507A1 (en) * | 2013-07-22 | 2015-03-12 | Recommind, Inc. | Information extraction and annotation systems and methods for documents |
CN104715032A (en) * | 2015-03-12 | 2015-06-17 | 福建工程学院 | Mapping system and method of Chinese and English table name and field name of report system |
CN108509199A (en) * | 2018-03-09 | 2018-09-07 | 平安科技(深圳)有限公司 | Automatically generate the method, apparatus, equipment and storage medium of Chinese annotation |
CN112181936A (en) * | 2019-07-03 | 2021-01-05 | 北京京东尚科信息技术有限公司 | Database detection method and device |
CN111078671A (en) * | 2019-12-19 | 2020-04-28 | 北京启迪区块链科技发展有限公司 | Method, device, equipment and medium for modifying data table field |
CN111061742A (en) * | 2019-12-25 | 2020-04-24 | 北京数起科技有限公司 | Method and device for marking data and service system thereof |
CN111831624A (en) * | 2020-07-14 | 2020-10-27 | 北京三快在线科技有限公司 | Data table creating method and device, computer equipment and storage medium |
Non-Patent Citations (1)
Title |
---|
陈凯等: "源代码变量名的自动语义标注", 《计算机应用研究》, vol. 33, no. 11, pages 3349 - 3353 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109241251B (en) | Conversation interaction method | |
CN111222305B (en) | Information structuring method and device | |
Gweon et al. | Three methods for occupation coding based on statistical learning | |
WO2020007028A1 (en) | Medical consultation data recommendation method, device, computer apparatus, and storage medium | |
JP2020123318A (en) | Method, apparatus, electronic device, computer-readable storage medium, and computer program for determining text relevance | |
WO2021114810A1 (en) | Graph structure-based official document recommendation method, apparatus, computer device, and medium | |
TW201841121A (en) | A method of automatically generating semantic similar sentence samples | |
WO2021175005A1 (en) | Vector-based document retrieval method and apparatus, computer device, and storage medium | |
CN112651236B (en) | Method and device for extracting text information, computer equipment and storage medium | |
US20090234852A1 (en) | Sub-linear approximate string match | |
CN108509490B (en) | Network hot topic discovery method and system | |
CN110598787B (en) | Software bug classification method based on self-defined step length learning | |
CN115659226A (en) | Data processing system for acquiring APP label | |
Trabelsi et al. | SeLaB: Semantic labeling with BERT | |
CN109189848B (en) | Knowledge data extraction method, system, computer equipment and storage medium | |
KR101917139B1 (en) | Server for providing job platform and operating method thereof | |
CN116955538B (en) | Medical dictionary data matching method and device, electronic equipment and storage medium | |
CN111680146A (en) | Method and device for determining new words, electronic equipment and readable storage medium | |
CN112287657A (en) | Information matching system based on text similarity | |
CN113032360A (en) | Method for inferring database field meaning | |
CN116805044A (en) | Label acquisition method, electronic equipment and storage medium | |
CN115017269B (en) | Data processing system for determining similar texts | |
CN114969371A (en) | Heat sorting method and device of combined knowledge graph | |
CN110019829A (en) | Data attribute determines method, apparatus | |
CN110633363B (en) | Text entity recommendation method based on NLP and fuzzy multi-criterion decision |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |