CN111191430A

CN111191430A - Automatic table building method and device, computer equipment and storage medium

Info

Publication number: CN111191430A
Application number: CN201911371969.9A
Authority: CN
Inventors: 欧阳智; 李明轩
Original assignee: Ping An Property and Casualty Insurance Company of China Ltd
Current assignee: Ping An Property and Casualty Insurance Company of China Ltd
Priority date: 2019-12-27
Filing date: 2019-12-27
Publication date: 2020-05-22
Anticipated expiration: 2039-12-27
Also published as: CN111191430B

Abstract

The application relates to the technical field of database table establishment, in particular to an automatic table establishment method, an automatic table establishment device, computer equipment and a storage medium, wherein the method comprises the following steps: acquiring historical tabulation data, extracting data with the occurrence frequency larger than a preset frequency threshold value from the historical tabulation data as a root word, and sequencing all the root words to generate a root word library; receiving table building information input by a user, extracting keywords in the table building information, and performing positive direction matching on the keywords and roots in a root bank to obtain a first matching result; performing inverse direction matching on the keywords and the roots in the root bank to obtain a second matching result; and comparing the first matching result with the second matching result and then performing table building. The matching results are compared in a forward matching mode and a reverse matching mode, so that word segmentation can be accurately obtained to the maximum extent, and the requirements of a user on the table can be quickly and accurately met.

Description

Automatic table building method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of database table building technologies, and in particular, to an automatic table building method and apparatus, a computer device, and a storage medium.

Background

In a database, a TABLE (TABLE) is an object used for storing data in the database, is a set of structured data, and is the basis of the entire database system. In an operating system, creating a table means that in storage management, the system will build a mapping table for each process.

At present, when a table is built, a table building assistant is usually adopted as a table building tool, the table building assistant is a submodule embedded in a metadata platform, the purpose is to standardize data table building, and the sql required by a user is matched by taking a basic root of root management as a support.

However, the existing table creation assistant cannot perform word segmentation accurately when creating a table, so that the requirements of a user on the table cannot be met quickly and accurately.

Disclosure of Invention

Based on this, aiming at the problem that the existing table building assistant cannot accurately perform word segmentation when building a table, so that the requirement of a user on the table cannot be quickly and accurately met, the automatic table building method, the automatic table building device, the computer equipment and the storage medium are provided.

An automatic table building method, comprising the steps of:

acquiring historical tabulation data, extracting data with the occurrence frequency larger than a preset frequency threshold value from the historical tabulation data as a root word, and sequencing all the root words to generate a root word library;

receiving table building information input by a user, extracting keywords in the table building information, and performing positive direction matching on the keywords and roots in the root bank to obtain a first matching result;

performing inverse direction matching on the keywords and the roots in the root bank to obtain a second matching result;

giving different weights to each element in the first matching result, extracting a root word in the first matching result, and multiplying the root word in the first matching result by the corresponding weight to obtain a first score;

giving different weights to each element in the second matching result, extracting a root word in the second matching result, and multiplying the root word in the second matching result by the corresponding weight to obtain a second score;

and if the first score is consistent with the second score, the first matching result or the second matching result is used as a word segmentation scheme for building a table, otherwise, the matching result with a low score in the first score and the second score is used as the word segmentation scheme for building the table.

In one possible embodiment, the obtaining the past table building data, extracting data with occurrence times greater than a preset time threshold from the past table building data as a root, and sorting all the roots to generate a root bank includes:

acquiring a plurality of historical tabular building data, and respectively cleaning the historical tabular building data;

sorting the data after data cleaning according to the occurrence times of preset words, and taking the preset word with the largest occurrence time as a root word;

and summarizing all roots, and sequencing according to the occurrence times to generate the root bank.

In one possible embodiment, the receiving table building information input by a user, extracting keywords in the table building information, and performing positive direction matching on the keywords and roots in the root bank to obtain a first matching result includes:

receiving a table building instruction input by a user, and acquiring table building information required by the user from the table building instruction, wherein the table building information comprises a table field and a table name;

extracting key characters in the table fields, and splicing the key characters and the table names to obtain the key words;

acquiring the number of characters of each root in the root bank, acquiring the number of the keywords characters by taking the maximum value N of the number of the characters as a first number of characters, and taking the number of the keywords characters as a second number of characters;

if the second number of characters is smaller than the first number of characters, matching along the positive direction of the root bank by taking the keywords as query conditions, and extracting the root with the first number of characters equal to the second number of characters as the first matching result;

if the second number of characters is not less than the first number of characters, intercepting the first N characters in the keywords as query conditions, matching along the positive direction of the root bank, and extracting the root of which the first number of characters is equal to the second number of characters as the first matching result.

In one possible embodiment, the obtaining a second matching result after performing inverse direction matching on the keyword and the root word in the root word library includes:

if the second number of characters is smaller than the first number of characters, matching along the reverse direction of the root bank by taking the keywords as query conditions, and extracting the root of which the first number of characters is equal to the second number of characters as a second matching result;

if the second number of characters is not less than the first number of characters, intercepting the last N characters in the keywords as query conditions, matching along the reverse direction of the root bank, and extracting the roots of which the first number of characters is equal to the second number of characters as the first matching results.

In one possible embodiment, the giving different weights to the elements in the first matching result includes:

dividing elements in the first matching result into single words, word roots or non-word roots according to word attributes;

obtaining the word segmentation quantity of the root word according to the table building information input by the user;

and giving different weights to the single words, the number of the participles and the non-root words.

In one possible embodiment, after the performing the table-building with the first matching result or the second matching result as the word segmentation scheme, the method further includes:

respectively acquiring a first table main key for building a table according to the first matching result and a second table main key for building a table according to the second matching result;

and comparing the lengths of the first table main key and the second table main key, if the lengths are consistent, marking the table as successful table building, otherwise, re-scoring the first matching result and the second matching result until the table building is successful.

An automatic table building device comprises the following modules:

the root library module is used for acquiring the historical tabulation data, extracting data with the occurrence frequency larger than a preset frequency threshold value from the historical tabulation data as roots, and sequencing all the roots to generate a root library;

the first matching result module is used for receiving the table building information input by the user, extracting keywords in the table building information, and performing positive direction matching on the keywords and the roots in the root bank to obtain a first matching result;

the second matching result module is used for obtaining a second matching result after the keyword is subjected to inverse direction matching with the root word in the root word library;

the first scoring module is configured to give different weights to each element in the first matching result, extract a root word in the first matching result, and multiply the root word in the first matching result and the corresponding weight to obtain a first score;

the second scoring module is configured to give different weights to each element in the second matching result, extract a root word in the second matching result, and multiply the root word in the second matching result and the corresponding weight to obtain a second score;

and the table building module is set to build a table by taking the first matching result or the second matching result as a word segmentation scheme if the first score is consistent with the second score, or else, build a table by taking the matching result with the lower score in the first score and the second score as the word segmentation scheme.

In one possible embodiment, the first matching result generating module is further configured to:

A computer device comprising a memory and a processor, the memory having stored therein computer readable instructions which, when executed by the processor, cause the processor to perform the steps of the above-described automatic table building method.

A storage medium having stored thereon computer-readable instructions which, when executed by one or more processors, cause the one or more processors to perform the steps of the above-described automatic table creation method.

Compared with the existing mechanism, the method carries out accurate word segmentation through the forward matching and reverse matching, and the specific steps are as follows: acquiring historical tabulation data, extracting data with the occurrence frequency larger than a preset frequency threshold value from the historical tabulation data as a root word, and sequencing all the root words to generate a root word library; receiving table building information input by a user, extracting keywords in the table building information, and performing positive direction matching on the keywords and roots in the root bank to obtain a first matching result; performing inverse direction matching on the keywords and the roots in the root bank to obtain a second matching result; giving different weights to each element in the first matching result, extracting a root word in the first matching result, and multiplying the root word in the first matching result by the corresponding weight to obtain a first score; giving different weights to each element in the second matching result, extracting a root word in the second matching result, and multiplying the root word in the second matching result by the corresponding weight to obtain a second score; and if the first score is consistent with the second score, the first matching result or the second matching result is used as a word segmentation scheme for building a table, otherwise, the matching result with a low score in the first score and the second score is used as the word segmentation scheme for building the table. The method and the device can accurately obtain the word segmentation to a greater extent, and quickly and accurately meet the requirements of the user on the table.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application.

FIG. 1 is a general flow diagram of an automatic table creation method in one embodiment of the present application;

FIG. 2 is a schematic diagram illustrating a root-of-word repository generation process in an automatic table creation method according to an embodiment of the present application;

FIG. 3 is a diagram illustrating a first matching result generation process in an automatic table creation method according to an embodiment of the present application;

FIG. 4 is a block diagram of an automatic table creation device in one embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

Fig. 1 is an overall flowchart of an automatic table building method according to an embodiment of the present application, and the automatic table building method includes the following steps:

s1, obtaining historical tabulation data, extracting data with the occurrence frequency larger than a preset frequency threshold value from the historical tabulation data as a root word, and sequencing all the root words to generate a root word library;

specifically, the source of the calendar data may be data stored in an internal database, or data from the internet. If the data of the previous table building is data from the internal database, a plurality of tables which are once created can be called from the internal database in a key word query mode. If the data source is the internet, technologies such as web crawlers and the like can be adopted to crawl relevant table building data from a cloud server in the internet.

The time threshold in this step may be determined according to different types of tables, and common table types include DBD, HEAP, ISAM, MERGE, MyIAS, InnoDB, and the like. The number threshold of times corresponding to different types of tables is different, for example, 10 for InnodB, 5 for HEAP, etc.

S2, receiving list building information input by a user, extracting keywords in the list building information, and performing positive direction matching on the keywords and the roots in the root bank to obtain a first matching result;

specifically, the keywords mainly refer to table names in the building table, table types, and characteristic characters appearing in fields of the table, such as numerical values, special symbols, and the like. And performing positive direction matching, namely performing matching once from the left end starting point to the right end final point of the streaming data of the root bank. And if the roots in the root bank are not arranged by adopting the streaming data, converting the data in the root bank into a streaming data arrangement mode.

S3, performing inverse direction matching on the keywords and the roots in the root bank to obtain a second matching result;

specifically, when performing inverse direction matching, the matching method adopted is consistent with the positive direction matching, and the difference is only the direction difference, i.e., one is from left to right, and the other is from right to left.

S4, giving different weights to each element in the first matching result, extracting a root word in the first matching result, and multiplying the root word in the first matching result by the corresponding weight to obtain a first score;

specifically, the first matching result mainly includes three elements, which are roots, words or non-roots. When weighting is performed, the number of participles required for the root word needs to be used as a parameter. That is, the weight that can be given according to the actual scene is 0.3 for a single word, 0.4 for the number of participles, and 0.3 for the non-root word. In the evaluation, it is possible to employ: and (4) carrying out score calculation by using a formula of 0.3 × n of individual characters (the number of participles in the scheme) + 0.4+ 0.3 of non-root words.

S5, giving different weights to each element in the second matching result, extracting a root word in the second matching result, and multiplying the root word in the second matching result by the corresponding weight to obtain a second score;

wherein, the weight statistical mode in the step is consistent with the mode in the step 4.

S6, if the first score is consistent with the second score, the first matching result or the second matching result is used as a word segmentation scheme for building a table, otherwise, the matching result with the low score in the first score and the second score is used as the word segmentation scheme for building the table.

Specifically, when the word segmentation scheme is selected, the matching result with a low score is selected as the word segmentation priority scheme, that is, the matching degree with a low score is high. Generally, the situation that the scores of two matching results are consistent does not occur, if the scores are consistent, the situation is firstly checked, and if any error is reported, any matching result is taken as a word segmentation scheme to be established.

In the embodiment, the matching results are compared in a forward matching mode and a reverse matching mode, so that the word segmentation can be accurately obtained to a greater extent, and the requirement of a user on the table can be quickly and accurately met.

Fig. 2 is a schematic diagram of a process of generating a root base in an automatic table building method in an embodiment of the present application, as shown in the drawing, in which, in step S1, table building data of a previous time is obtained, data whose occurrence times are greater than a preset time threshold value are extracted from the table building data of the previous time to be used as roots, and all the roots are sorted to generate the root base, where the process includes:

s11, obtaining a plurality of historical tabular data, and respectively cleaning the historical tabular data;

the Data cleaning method mainly cleans four Data in the table building Data, namely a missing value, an abnormal value (outlier), a Duplicate removal process (Duplicate Data) and noise Data, and mainly cleans the abnormal value in the table building Data in the invention, wherein the abnormal value cleaning method mainly comprises the following steps: 1, completing simple statistical analysis in the EDA, and only using a descriptor method of pandas to realize the simple statistical analysis, and finding whether unreasonable values, namely abnormal values exist or not through data set descriptive statistics; 2.

principle- -outlier detection based on Normal distribution if data obeys normal distribution

In principle, an outlier is a value that deviates more than 3 standard deviations from the mean in a set of measurements. If the data obeys normal distribution, the distance average

The probability of occurrence of a value other than that is

Belonging to very individual small probability events. If the data does not follow a normal distribution, it can also be described in terms of how many times the standard deviation is away from the mean. 3. And (3) detection based on a model: firstly, establishing a data model, wherein the abnormity is objects which cannot be perfectly fitted by the same model; if the model is a collection of clusters, then an anomaly is an object that does not significantly belong to any cluster; when using regression models, anomalies are objects that are relatively far from predicted values.

S12, sorting the data after data cleaning according to the occurrence times of preset words, and taking the preset word with the largest occurrence time as a root word;

specifically, the preset word is mainly some business terms, such as a table in the insurance field, and the business terms may be: premium, applicant and applicant, and the like, are terms relating to insurance. The root word is the basis of word segmentation in the table building process, and the table building data can be effectively classified according to different root words.

And S13, summarizing all roots, sorting according to the occurrence frequency from most to least, and generating the root bank.

When sorting is performed, sorting is performed by generally adopting the way that the occurrence times are high to low, namely, the root with the highest occurrence time is arranged at the top, for two or more roots with the same occurrence times, the initial letters or the stroke numbers of the roots are obtained, and the roots with the same occurrence times are sorted according to the initial letters or the stroke numbers.

According to the embodiment, the data of the past table building data are cleaned, so that the effectiveness of the root word in the root word library is guaranteed.

Fig. 3 is a schematic diagram illustrating a process of generating a first matching result in an automatic table creation method according to an embodiment of the present application, where as shown in the drawing, the step S2 of receiving table creation information input by a user, extracting a keyword in the table creation information, and performing positive direction matching between the keyword and a root in the root bank to obtain a first matching result includes:

s21, receiving a table building instruction input by a user, and acquiring table building information required by the user from the table building instruction, wherein the table building information comprises a table field and a table name;

specifically, when a user needs to build a table, the user can send a table building instruction through a table building operation interface generated in advance. The table building operation interface generated in advance comprises a first input box, a first option and a second option; the first input box is used for a user to select fields, table names, table types and the like of the input table according to the type and the action of the established table; specifically, a field of the table refers to a column name used to store data; for example, a class has 3 students (small leaves, plums, king), if the class represents the name of the table, the name of the class is the field, the table has 3 fields, and then each field name is followed by corresponding attribute, for example, if you give them several apples, the number of apples and apples is equivalent to the attribute of the field, and the attribute of the field is equivalent to the attribute of the data stored in the field; the names of the tables can be defined by the developer according to the needs of the development project, and the types of the tables commonly used include page tables, segment tables, file allocation tables, tables in the file storage space, and the like.

S22, extracting key characters in the table fields, and splicing the key characters and the table names to obtain the key words;

specifically, the key characters mainly refer to the characters reflecting the core content of the table, such as the name of the student and the name of the class in the above example, and two splicing modes can be adopted when the key characters and the names of the table are spliced, namely, the key characters are in the front and the names of the table are in the back; or the name of the table is before and the key character is after.

S23, acquiring the number of characters of each root in the root library, acquiring the number of the keywords characters by taking the maximum value N of the number of the characters as a first number of characters, and taking the number of the keywords characters as a second number of characters;

Specifically, the longest root word number in the root word library is found to be N, and the input Chinese character word number is assumed to be L. If L is larger than N, intercepting the first N Chinese characters as the fields to be matched for root matching, otherwise, taking the whole Chinese character sequence as the fields to be matched for direct matching in a root library. If the word with the word number L exists in the root word, the matching is successful; if the word can not be found in the root library, the matching is failed, the last word in the field to be matched is removed, the remaining Chinese character sequence is used as the field to be matched to perform matching processing in the dictionary again … …, and the matching is continued until the matching is successful, namely, a word is cut out, or until the length of the remaining word string is 1, namely, a single word.

In the embodiment, the first matching result of the forward matching is obtained through character number comparison, so that the accuracy of the comparison of the matching results is ensured.

In an embodiment, the S3, performing inverse direction matching on the keyword and the root in the root bank to obtain a second matching result, including:

In one embodiment, said giving different weights to elements in said first matching result comprises:

In one embodiment, after the performing the table-building with the first matching result or the second matching result as a word segmentation scheme, the method further comprises:

According to the embodiment, whether the scoring is consistent or not can be checked, and the error of the word segmentation scheme caused by parameter error can be effectively prevented through the verification of the table building main key, so that the error of table building is greatly avoided.

The technical features mentioned in any of the above corresponding embodiments or implementations are also applicable to the embodiment corresponding to fig. 4 in the present application, and the details of the subsequent similarities are not repeated.

The above description is directed to an automatic table building method, and the following description is directed to an apparatus for performing the automatic table building.

Fig. 4 is a block diagram of an automatic table creation apparatus, which can be applied to automatic table creation. The automatic table creation apparatus in the embodiment of the present application can implement the steps corresponding to the automatic table creation method executed in the embodiment corresponding to fig. 1. The function realized by the automatic table building device can be realized by hardware, and can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the above functions, which may be software and/or hardware.

In one embodiment, an automatic table building apparatus is provided, as shown in fig. 4, including the following modules:

the root library module 10 is configured to acquire the historical tabulation data, extract data with the occurrence frequency larger than a preset frequency threshold value from the historical tabulation data as roots, sort all the roots and generate a root library;

the first matching result module 20 is configured to receive table building information input by a user, extract keywords in the table building information, and perform positive direction matching on the keywords and roots in the root bank to obtain a first matching result;

a second matching result module 30, configured to perform inverse direction matching on the keyword and the root in the root bank to obtain a second matching result;

a first scoring module 40 configured to give different weights to each element in the first matching result, extract a root word in the first matching result, and multiply the root word in the first matching result with the corresponding weight to obtain a first score;

a second scoring module 50 configured to assign different weights to each element in the second matching result, extract a root word in the second matching result, and multiply the root word in the second matching result with the corresponding weight to obtain a second score;

and the table building module 60 is configured to build a table by using the first matching result or the second matching result as a word segmentation scheme if the first score is consistent with the second score, and otherwise, build a table by using a matching result with a low score in the first score and the second score as a word segmentation scheme.

In one embodiment, the first matching result module is further configured to:

In one embodiment, a computer device is provided, the computer device comprising a memory and a processor, the memory having stored therein computer-readable instructions, which, when executed by the processor, cause the processor to perform the steps of the automatic table building method in the above embodiments.

In one embodiment, a storage medium is provided that stores computer-readable instructions, which when executed by one or more processors, cause the one or more processors to perform the steps of the automatic table creation method in the above embodiments. Wherein the storage medium may be a non-volatile storage medium.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable storage medium, and the storage medium may include: a Read Only Memory (ROM), a Random Access Memory (RAM), a magnetic or optical disk, or the like.

The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-described embodiments are merely illustrative of some embodiments of the present application, which are described in more detail and detail, but are not to be construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. An automatic table building method, comprising:

2. The automatic table building method according to claim 1, wherein the obtaining of the past table building data, extracting data with occurrence frequency greater than a preset frequency threshold from the past table building data as a root word, and generating a root word library after sorting all the root words comprises:

3. The automatic table building method of claim 1, wherein the receiving table building information input by a user, extracting keywords in the table building information, and performing positive direction matching between the keywords and the roots in the root bank to obtain a first matching result comprises:

4. The automatic table building method according to claim 3, wherein said obtaining a second matching result after performing inverse direction matching on said keyword and a root in said root bank comprises:

5. The method of claim 1, wherein said assigning different weights to elements in said first match result comprises:

6. The automatic table creation method according to any one of claims 1 to 5, wherein after the table creation is performed with the first matching result or the second matching result as a word segmentation scheme, the method further comprises:

7. An automatic table building device is characterized by comprising the following modules:

8. The automatic table building apparatus of claim 7, wherein the first matching result generation module is further configured to:

9. A computer device comprising a memory and a processor, the memory having stored therein computer-readable instructions, wherein the computer-readable instructions, when executed by the processor, cause the processor to perform the steps of the method of automatically creating tables according to any one of claims 1 to 6.

10. A storage medium having computer-readable instructions stored thereon which, when executed by one or more processors, cause the one or more processors to perform the steps of the method of automatically creating tables of any of claims 1 to 6.