CN113032360A - Method for inferring database field meaning - Google Patents

Method for inferring database field meaning Download PDF

Info

Publication number
CN113032360A
CN113032360A CN202110239741.5A CN202110239741A CN113032360A CN 113032360 A CN113032360 A CN 113032360A CN 202110239741 A CN202110239741 A CN 202110239741A CN 113032360 A CN113032360 A CN 113032360A
Authority
CN
China
Prior art keywords
chinese
field
fields
meaning
database
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110239741.5A
Other languages
Chinese (zh)
Inventor
唐弋钧
聂敏
杨磊
李春
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan Hanku Zhishu Technology Co ltd
Original Assignee
Sichuan Hanku Zhishu Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan Hanku Zhishu Technology Co ltd filed Critical Sichuan Hanku Zhishu Technology Co ltd
Priority to CN202110239741.5A priority Critical patent/CN113032360A/en
Publication of CN113032360A publication Critical patent/CN113032360A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/169Annotation, e.g. comment data or footnotes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases

Abstract

The invention provides a method for conjecturing the meaning of a database field, which can predict the meaning of the field with unknown meaning of the database according to the characteristics of the database field and an accumulated knowledge base. The method combines the knowledge base and the machine learning technology, and a plurality of methods guess the unknown Chinese annotation of the field, so that the real meaning of the unknown field can be better obtained, and good early work is performed for the next data management work.

Description

Method for inferring database field meaning
Technical Field
The invention particularly relates to a method for inferring the meaning of a database field.
Background
In the current society with highly developed informatization, various enterprises and public institutions have various informatization systems, but due to various reasons, databases in the informatization systems may have various problems such as lack of comments on database fields, incomplete database files, unused database fields and the like. These problems can lead to difficulties in upgrading systems, data governance, data analysis, system use, etc. Many business systems only overturn the reconstruction due to the problems, and a large amount of manpower, material resources and social resources are wasted.
The method conjectures the Chinese meaning of the unknown field of the database by various methods, and can confirm the conjecture accuracy of the unknown field by the mark of the user, thereby providing a new solution for the above problems and having great social significance.
Disclosure of Invention
The present invention is directed to a method for inferring the meaning of a database field, which can solve the above problems.
In order to meet the requirements, the technical scheme adopted by the invention is as follows: a method for inferring the meaning of a database field is provided, the method for inferring the meaning of a database field comprising the steps of:
s1: summarizing a common field knowledge base, and acquiring three common comments of common field names and scores of the comments;
s2: judging whether the field is English or English-like, if so, the meaning of the field is Chinese translation;
s3: judging the similarity between the field without Chinese annotation and other fields with Chinese annotation in the same database, wherein the field with high similarity has the same meaning;
s4: obtaining different Chinese annotations and scores thereof according to the method;
s5: the real meaning of the field is selected according to the recommended result, if the results are not in the user selection range, the user writes by self, and the result marked by the user affects the accuracy of subsequent recommendation.
The method for inferring the meaning of the database field has the following advantages:
the method combines the knowledge base and the machine learning technology, and a plurality of methods guess the unknown Chinese annotation of the field, so that the real meaning of the unknown field can be better obtained, and good early work is performed for the next data management work.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
fig. 1 schematically shows a flow diagram according to an embodiment of the application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be described in further detail with reference to the accompanying drawings and specific embodiments.
In the following description, references to "one embodiment," "an embodiment," "one example," "an example," etc., indicate that the embodiment or example so described may include a particular feature, structure, characteristic, property, element, or limitation, but every embodiment or example does not necessarily include the particular feature, structure, characteristic, property, element, or limitation. Moreover, repeated use of the phrase "in accordance with an embodiment of the present application" although it may possibly refer to the same embodiment, does not necessarily refer to the same embodiment.
Certain features that are well known to those skilled in the art have been omitted from the following description for the sake of simplicity.
According to an embodiment of the present application, there is provided a method for inferring the meaning of a database field, as shown in fig. 1, including the following steps:
s1 summarizes the common field knowledge base, obtains three common comments of the common field names and the scores of the comments.
S1.1, acquiring a batch of common database metadata information, uniformly processing database fields into capitals, and reserving the fields with Chinese notes, wherein the fields comprise fields Z1 and Z2 … Zi. And establishing a mapping table of the fields and the Chinese notes. As shown in table 1:
name of field (neglecting capital and small cases) Chinese notes
Z1 Name (I)
Z1 Name (R)
Z1 Name of student
Z2 Sex
Z3 Sex
Z4 College of academic
Z4 School
.... ....
TABLE 1
S1.2 a field may have multiple Chinese notes, for example, the field Zx has { x1, x2, x3 … … xn } n Chinese notes, then according to the statistical order of the occurrence times of the n Chinese notes, the top three Chinese notes are retained, and the scores are respectively assigned as a1, a2 and a3(a 1> a2> a 3).
S1.3 traverses all fields according to the rules of S1.2, leaving the field with chinese annotations therein with n (n < ═ 3) chinese meanings, and each chinese meaning has a corresponding score.
S1.4, Chinese meanings of i fields and scores corresponding to each Chinese meaning are obtained through the method, and the data are used as a knowledge base.
S1.5 if a database is provided, a field without Chinese annotation is arranged in the database, the corresponding name can be matched in the knowledge base through the field name, and the corresponding n Chinese meanings and the scores thereof can be found. This estimation method is denoted as method a.
S2, judging whether a certain field is English or English-like, if so, the meaning of the field can be the Chinese translation.
S2.1, acquiring a batch of common database metadata information, and uniformly processing database fields into capitals, wherein the database fields comprise fields Z1 and Z2 … Zi. And judging whether the fields are English words or not through codes, and if the fields are English words, acquiring the translated Chinese meaning through an online translation API.
S2.2 a field can have a plurality of Chinese translations, for example, the field Z1 has n Chinese translations { x1, x2, x3 … … xn }, then according to the statistical ordering of the occurrence times of n Chinese annotations, the top three Chinese annotations are reserved, and the scores are given as b1, b2 and b3(b1> b2> b 3).
S2.3 if there is a database, there is a field without Chinese annotation in it, the above method obtains Chinese annotation and Chinese annotation score for these fields, and this method is marked as method B.
S3 judges the similarity between a field without Chinese annotation and other fields with Chinese annotation in the same database, the field with higher similarity may have the same meaning. Fields such as id1 and id both represent the primary key.
S3.1, acquiring a field Z1 without Chinese comments to be treated as capital, and acquiring the editing times of the field Z2 with Chinese comments by editing distance, wherein the more times, the smaller the similarity value. The edit distance between the two is D1, and the edit distance is calculated as follows:
the Levenshtein Distance of two strings a, b is denoted as leva,b(|a | and b |), whose | a | neutralizes | b | correspond to the lengths of a and b, respectively. Then, here, the Levenshtein Distance, i.e., lev, of the two strings a, ba,b(| a |, | b |) can be described in the following mathematical language:
Figure BDA0002961708620000041
the above formula is specifically set forth below:
definition leva,b(i, j) refers to the distance between the first i characters in a and the first j characters in b. For ease of understanding, i, j herein may be considered as a length. The first character index of the string starts from 1 (actually because the string needs to be complemented by 0 when operating on the table), so the last edit distance is the distance when i equals the absolute value of a, j equals the absolute value of b: leva,b(|a|,|b|)。
When min (i, j) is 0, corresponding to the first i characters in the character string a and the first j characters in the character string b, at this time, i, j has a value of 0, which indicates that one of the character strings a and b is an empty string, then only max (i, j) single character editing operations need to be performed when going from a to b, so the editing distance between them is max (i, j), i.e. the maximum of i, j.
S3.2Z 1 calculates the edit distance with all the fields with Chinese comments in the same database, obtains a group of edit distance sets { D1, D2, D3 … Dn }, sorts the n edit distances from small to large, retains the fields corresponding to the edit distances of the first three ranks and the Chinese comments thereof, and gives the values of c1, c2 and c3(c1> c2> c 3).
S3.3 if there is a database, there is a field without Chinese annotation in it, the above method obtains Chinese annotation and the weight of Chinese annotation for these fields, and this method is marked as method C.
S4, knowing a field Zx with unknown Chinese meaning, obtaining different Chinese comments and scores thereof according to the different methods.
S4.1 method a gave a1, a2, a 3; method B gave B1, B2, B3; method C gave C1, C2, C3.
S4.2 weights { x, y, z }. epsilon (0, 1) for method A, method B, and method C, respectively, depending on the effectiveness of method A, B, C. The score for method a should be multiplied by its weight, e.g., a1x, a2x, a3 x; the same applies to the method B and the method C.
S4.3, the scores of the Chinese annotations obtained by the field Zx are sorted according to a method A, B, C, the Chinese meaning possibility obtained by the field is represented by the sorted Chinese annotation sequence, and the Chinese annotations are recommended to the user according to the sort and the scores.
S5 the user selects the real meaning of the field according to the recommended result, the result is not in the user selection range, the user can write by self-definition, and the result marked by the user can influence the accuracy of the subsequent recommendation.
S5.1 if the fields of the same name of a plurality of databases are marked as a Chinese annotation for more than n times or the same Chinese annotation is customized, the mapping of the Chinese annotation and the fields is put into the knowledge base of the method A.
S5.2 if a certain field is customized to a certain name, the custom writing is put into the knowledge base of the party A, and the score is set to be the lowest score of all the mapped Chinese annotations of the current field.
According to one embodiment of the application, the method for inferring the meaning of the database field comprehensively evaluates the correct meaning of an unknown field by using a plurality of methods, meanwhile, the user confirms the Chinese meaning of the recommended field to improve the accuracy rate of subsequent recommendation, and judges the similarity of the database field by using the shearing distance, so that the Chinese meaning is inferred.
The above-mentioned embodiments only show some embodiments of the present invention, and the description thereof is more specific and detailed, but should not be construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the claims.

Claims (6)

1. A method of inferring a meaning of a database field, comprising the steps of:
s1: summarizing a common field knowledge base, and acquiring three common comments of common field names and scores of the comments;
s2: judging whether the field is English or English-like, if so, the meaning of the field is Chinese translation;
s3: judging the similarity between the field without Chinese annotation and other fields with Chinese annotation in the same database, wherein the field with high similarity has the same meaning;
s4: obtaining different Chinese annotations and scores thereof according to the method;
s5: the real meaning of the field is selected according to the recommended result, if the results are not in the user selection range, the user writes by self, and the result marked by the user affects the accuracy of subsequent recommendation.
2. The method for inferring meaning of database fields according to claim 1, wherein said S1 is: summarizing a common field knowledge base, acquiring three common comments of common field names and scores of the comments, wherein the process is defined as a method A, and the method specifically comprises the following steps:
s1.1: acquiring metadata information of a database, uniformly processing fields of the database into uppercase, reserving the fields with Chinese annotations, wherein the fields comprise fields Z1 and Z2 … Zi, and establishing a mapping table of the fields and the Chinese annotations;
s1.2: one field has a plurality of Chinese notes, the Chinese notes are sorted according to the statistics of the occurrence times of the Chinese notes, the Chinese notes of the first three are reserved, the scores are respectively assigned as a1, a2 and a3, and a1> a2> a 3;
s1.3: traversing all the fields according to the rule of the step S1.2, and enabling the fields with Chinese annotations to keep n Chinese meanings, wherein n is less than or equal to 3, and each Chinese meaning has a corresponding score;
s1.4: acquiring the Chinese meaning of the i character fields and the score corresponding to each Chinese meaning through the steps, and taking the data as a knowledge base; if a database is available, a field without Chinese annotation is arranged in the database, the corresponding name is matched in the knowledge base through the field name, and the corresponding Chinese meaning and the score of the Chinese meaning are found.
3. The method for inferring the meaning of a database field according to claim 2, characterized in that step S2: judging whether the field is English or English-like, if the field is English or English-like, the field meaning is in the step of translating Chinese, if a database is provided, a field without Chinese annotation is arranged in the database, and the process of acquiring the Chinese annotation and Chinese annotation score of the field is defined as a method B, which is specifically as follows:
s2.1: acquiring metadata information of a database, uniformly processing database fields into capital comprising fields Z1 and Z2 … Zi, judging whether the fields are English words or not through codes, translating the fields if the fields are English words, and acquiring translated Chinese meanings;
s2.2: a field has multiple chinese translations, then the n chinese notes are statistically ordered by the number of occurrences, and the top three chinese notes are retained, with scores assigned to b1, b2, b3, and b1> b2> b 3.
4. The method for inferring the meaning of a database field according to claim 3, characterized in that step S3: judging the similarity between the field without Chinese annotation and other fields with Chinese annotation in the same database, wherein if one field without Chinese annotation exists in the database in the field with high similarity has the same meaning, the process of acquiring the Chinese annotation and the weight of the Chinese annotation for the fields is defined as method C, and the specific steps are as follows:
s3.1: acquiring a field Z1 without Chinese comments to be processed as capital, and a field Z2 with Chinese comments to be processed as capital, acquiring the editing times of the field Z1 without Chinese comments and the editing times of the field Z2 with Chinese comments by editing distance, wherein the more the times, the smaller the similarity value and the editing distance between the field Z and the field D1;
s3.2: z1 calculates the edit distance with all the Chinese annotated fields in the same database, obtains a group of edit distance sets { D1, D2, D3 … Dn }, sorts the n edit distances from small to large, retains the fields corresponding to the edit distances of the first three ranks and the Chinese annotations thereof, and gives the values of c1, c2 and c3, and c1> c2> c 3.
5. The method of claim 4, wherein the method further comprises: step S4: different Chinese annotations and the scores thereof are obtained according to the method, which comprises the following steps:
s4.1: method a gave a1, a2, a 3; method B gave B1, B2, B3; method C gave C1, C2, C3;
s4.2: according to the effectiveness of the method A, B, C, the method A, the method B and the method C are respectively given a weight { x, y, z }. epsilon (0, 1);
s4.3: the scores of the Chinese annotations obtained for the field are sorted according to method A, B, C, and the likelihood of Chinese meaning obtained for the field is indicated by the sorted Chinese annotation precedence order, and recommended to the user according to the sort and the scores.
6. The method of claim 5, wherein the step of inferring the meaning of the database field comprises the steps of: selecting the real meaning of the field according to the recommended result, if the results are not in the user selection range, the user writes by self, the result marked by the user can influence the accuracy of the subsequent recommendation, and the specific steps are as follows:
s5.1: if a plurality of databases, the field of the same name is marked as a Chinese annotation for more than n times or the same Chinese annotation is customized, and the mapping of the Chinese annotation and the field is put into the knowledge base of the method A;
s5.2: if a field is customized to a name, the custom writing is placed into the knowledge base of method A, and the score is set to the lowest score among all the mapped Chinese annotations for the current field.
CN202110239741.5A 2021-03-04 2021-03-04 Method for inferring database field meaning Pending CN113032360A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110239741.5A CN113032360A (en) 2021-03-04 2021-03-04 Method for inferring database field meaning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110239741.5A CN113032360A (en) 2021-03-04 2021-03-04 Method for inferring database field meaning

Publications (1)

Publication Number Publication Date
CN113032360A true CN113032360A (en) 2021-06-25

Family

ID=76467503

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110239741.5A Pending CN113032360A (en) 2021-03-04 2021-03-04 Method for inferring database field meaning

Country Status (1)

Country Link
CN (1) CN113032360A (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150074507A1 (en) * 2013-07-22 2015-03-12 Recommind, Inc. Information extraction and annotation systems and methods for documents
CN104715032A (en) * 2015-03-12 2015-06-17 福建工程学院 Mapping system and method of Chinese and English table name and field name of report system
CN108509199A (en) * 2018-03-09 2018-09-07 平安科技(深圳)有限公司 Automatically generate the method, apparatus, equipment and storage medium of Chinese annotation
CN111061742A (en) * 2019-12-25 2020-04-24 北京数起科技有限公司 Method and device for marking data and service system thereof
CN111078671A (en) * 2019-12-19 2020-04-28 北京启迪区块链科技发展有限公司 Method, device, equipment and medium for modifying data table field
CN111831624A (en) * 2020-07-14 2020-10-27 北京三快在线科技有限公司 Data table creating method and device, computer equipment and storage medium
CN112181936A (en) * 2019-07-03 2021-01-05 北京京东尚科信息技术有限公司 Database detection method and device

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150074507A1 (en) * 2013-07-22 2015-03-12 Recommind, Inc. Information extraction and annotation systems and methods for documents
CN104715032A (en) * 2015-03-12 2015-06-17 福建工程学院 Mapping system and method of Chinese and English table name and field name of report system
CN108509199A (en) * 2018-03-09 2018-09-07 平安科技(深圳)有限公司 Automatically generate the method, apparatus, equipment and storage medium of Chinese annotation
CN112181936A (en) * 2019-07-03 2021-01-05 北京京东尚科信息技术有限公司 Database detection method and device
CN111078671A (en) * 2019-12-19 2020-04-28 北京启迪区块链科技发展有限公司 Method, device, equipment and medium for modifying data table field
CN111061742A (en) * 2019-12-25 2020-04-24 北京数起科技有限公司 Method and device for marking data and service system thereof
CN111831624A (en) * 2020-07-14 2020-10-27 北京三快在线科技有限公司 Data table creating method and device, computer equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
陈凯等: "源代码变量名的自动语义标注", 《计算机应用研究》, vol. 33, no. 11, pages 3349 - 3353 *

Similar Documents

Publication Publication Date Title
CN109241251B (en) Conversation interaction method
CN111222305B (en) Information structuring method and device
Gweon et al. Three methods for occupation coding based on statistical learning
WO2020007028A1 (en) Medical consultation data recommendation method, device, computer apparatus, and storage medium
JP2020123318A (en) Method, apparatus, electronic device, computer-readable storage medium, and computer program for determining text relevance
WO2021114810A1 (en) Graph structure-based official document recommendation method, apparatus, computer device, and medium
TW201841121A (en) A method of automatically generating semantic similar sentence samples
WO2021175005A1 (en) Vector-based document retrieval method and apparatus, computer device, and storage medium
CN112651236B (en) Method and device for extracting text information, computer equipment and storage medium
US20090234852A1 (en) Sub-linear approximate string match
CN108509490B (en) Network hot topic discovery method and system
CN110598787B (en) Software bug classification method based on self-defined step length learning
CN115659226A (en) Data processing system for acquiring APP label
Trabelsi et al. SeLaB: Semantic labeling with BERT
CN109189848B (en) Knowledge data extraction method, system, computer equipment and storage medium
KR101917139B1 (en) Server for providing job platform and operating method thereof
CN116955538B (en) Medical dictionary data matching method and device, electronic equipment and storage medium
CN111680146A (en) Method and device for determining new words, electronic equipment and readable storage medium
CN112287657A (en) Information matching system based on text similarity
CN113032360A (en) Method for inferring database field meaning
CN116805044A (en) Label acquisition method, electronic equipment and storage medium
CN115017269B (en) Data processing system for determining similar texts
CN114969371A (en) Heat sorting method and device of combined knowledge graph
CN110019829A (en) Data attribute determines method, apparatus
CN110633363B (en) Text entity recommendation method based on NLP and fuzzy multi-criterion decision

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination