CN113761185A

CN113761185A - Main key extraction method, equipment and storage medium

Info

Publication number: CN113761185A
Application number: CN202110012338.9A
Authority: CN
Inventors: 陈伯梁
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Priority date: 2021-01-06
Filing date: 2021-01-06
Publication date: 2021-12-07

Abstract

The embodiment of the application provides a method, equipment and a storage medium for extracting a main key, which are applied to the field of big data. The method comprises the following steps: the method comprises the steps of responding to a received primary key extraction request from a client to obtain a target table, wherein the primary key extraction request comprises identification information of the target table; performing line number reduction processing on the target table to obtain a first sub-table of the target table; combining fields in the first sub-table, and if the data size of the first sub-table after de-duplication based on the field combination is equal to the data size of the target table after de-duplication, determining the field combination as a main key of the target table; and sending the primary key to the client. By the method and the device, the primary key of the table can be extracted quickly, and the efficiency of determining the primary key is improved.

Description

Main key extraction method, equipment and storage medium

Technical Field

The present application relates to the field of big data, and in particular, to a method, device, and storage medium for extracting a primary key.

Background

In general, a primary key is not set when a table is created in a data warehouse, but in the application process of the table, for example, when the content contained in a single table is to be known (e.g., business analysis), or when an association relationship between multiple tables is to be found, a data developer analyzes the content in the single table or the content in the multiple tables based on the primary key. Therefore, it is important how to determine the primary key display of the table.

In the process of implementing the present application, the inventor finds that at least the following problems exist in the prior art: the prior art requires an artificial re-verification of whether a field is a primary key when determining whether certain fields in a table are primary keys. The field cannot be determined empirically without the knowledge of the data developer, requiring multiple guesses and verifications, resulting in inefficient determination of the primary key.

Disclosure of Invention

The embodiment of the application provides a method, equipment and a storage medium for extracting a primary key of a table.

In a first aspect, an embodiment of the present application provides a primary key extraction method, including: the method comprises the steps of responding to a received primary key extraction request from a client to obtain a target table, wherein the primary key extraction request comprises identification information of the target table; performing line number reduction processing on the target table to obtain a first sub-table of the target table; combining fields in the first sub-table, and if the data size of the first sub-table after de-duplication based on the field combination is equal to the data size of the target table after de-duplication, determining the field combination as a main key of the target table; and sending the primary key to the client.

In a possible implementation, the performing the number-of-lines reduction process on the target table to obtain the first sub-table of the target table may include: determining the information entropy of each row of data in the target table; performing row clustering on data of each row in the target table by adopting a first clustering algorithm; and extracting the row data of each row cluster, wherein the corresponding information entropy meets the preset condition, so as to obtain a first sub-table.

In a possible implementation manner, extracting line data in each line cluster, where corresponding information entropy satisfies a preset condition, to obtain a first sub-table, may include: extracting the row data of which the corresponding information entropy is larger than the preset information entropy in each row cluster to obtain a first sub-table; or, according to the sequence from high to low of the corresponding information entropies in each row cluster, extracting the row data corresponding to the information entropies with the information entropies in the preset number in the front of the information entropy sequence to obtain a first sub-table, wherein the preset number is determined according to the data quantity of the target table and the category number of the row clusters.

In a possible implementation manner, the larger the difference between the data size of the target table and the number of categories of the row cluster is, the larger the preset number is.

In a possible embodiment, combining fields in the first sub-table may include: performing column clustering on each column of data in the first sub-table by adopting a second clustering algorithm; and combining fields among the categories in each column of clusters.

In a possible implementation, the performing the number-of-lines reduction process on the target table to obtain the first sub-table of the target table may include: determining a dimension field contained in the target table according to the field information of each field in the target table; extracting data correspondingly stored in the dimension field in the target table to obtain a second sub-table; and performing line number reduction processing on the second sub-table to obtain a first sub-table of the target table.

In a possible implementation manner, determining the dimension field included in the target table according to the field information of each field in the target table may include: traversing each field in the target table, and obtaining a classification result of whether the field is a dimension field according to field information of the field and a classification model, wherein the classification model is used for identifying whether the field is the dimension field; and extracting the fields of which the classification results are dimension fields to obtain the dimension fields contained in the target table.

In one possible embodiment, obtaining a classification result of whether a field is a dimension field according to field information of the field and a classification model may include: mapping field information of the fields into word vectors by adopting a preset mapping algorithm; and inputting the word vector into a classification model to obtain a classification result of whether the field is a dimension field.

In a possible implementation manner, mapping field information of a field to a word vector by using a preset mapping algorithm may include at least one of the following:

if the field information contains a field name, mapping the field name into a first vector by adopting a word frequency-inverse text frequency index algorithm;

if the field type of the field information packet is the field type, mapping the field type into a second vector by adopting a word frequency-inverse text frequency index algorithm;

and if the field information contains the field description, mapping the field description into a third vector by adopting a word2vec algorithm.

The word vector is obtained by splicing at least one vector of the first vector, the second vector and the third vector. The predetermined mapping algorithm may include a word frequency-inverse text frequency index algorithm, and/or a word2vec algorithm.

In a second aspect, an embodiment of the present application provides a primary key extraction device, including:

the acquisition module is used for responding to a received primary key extraction request from a client to acquire a target table, wherein the primary key extraction request comprises identification information of the target table;

the processing module is used for performing line number reduction processing on the target table to obtain a first sub-table of the target table;

the verification module is used for carrying out field combination on each field in the first sub-table, and if the data volume of the first sub-table after the de-duplication is carried out on the basis of the field combination is equal to the data volume of the target table after the de-duplication, the field combination is determined to be a main key of the target table;

and the sending module is used for sending the primary key to the client.

In one possible implementation, the processing module may include:

the first determining submodule is used for determining the information entropy of each row of data in the target table;

the clustering submodule is used for carrying out row clustering on data of each row in the target table by adopting a first clustering algorithm;

and the first extraction submodule is used for extracting the row data of each row cluster, wherein the corresponding information entropy of each row data meets the preset condition, so that a first sub-table is obtained.

In a possible implementation, the first extraction submodule may be specifically configured to: extracting the row data of which the corresponding information entropy is larger than the preset information entropy in each row cluster to obtain a first sub-table; or, according to the sequence from high to low of the corresponding information entropies in each row cluster, extracting the row data corresponding to the information entropies with the information entropies in the preset number in the front of the information entropy sequence to obtain a first sub-table, wherein the preset number is determined according to the data quantity of the target table and the category number of the row clusters.

In a possible implementation manner, when the verification module combines fields in the first sub-table, the verification module may specifically be configured to: performing column clustering on each column of data in the first sub-table by adopting a second clustering algorithm; and combining fields among the categories in each column of clusters.

In one possible implementation, the processing module may include:

the second determining submodule is used for determining the dimension fields contained in the target table according to the field information of each field in the target table;

the second extraction submodule is used for extracting data correspondingly stored in the dimension field in the target table to obtain a second sub-table;

and the processing sub-module is used for performing line number reduction processing on the second sub-table to obtain a first sub-table of the target table.

In a possible implementation, the second determining submodule may be specifically configured to: traversing each field in the target table, and obtaining a classification result of whether the field is a dimension field according to field information of the field and a classification model, wherein the classification model is used for identifying whether the field is the dimension field; and extracting the fields of which the classification results are dimension fields to obtain the dimension fields contained in the target table.

In a possible implementation manner, when the second determining sub-module is configured to obtain, according to the field information of the field and the classification model, a classification result of whether the field is a dimension field, the second determining sub-module may be specifically configured to: mapping field information of the fields into word vectors by adopting a preset mapping algorithm; and inputting the word vector into a classification model to obtain a classification result of whether the field is a dimension field.

In a possible implementation manner, when the second determining sub-module is configured to map the field information of the field into the word vector by using a preset mapping algorithm, the second determining sub-module may be specifically configured to at least one of:

In a third aspect, an embodiment of the present application provides an electronic device, including a processor, a memory, and a computer program stored on the memory and executable on the processor, where the processor implements the method according to any one of the first aspect when executing the computer program.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium, in which a computer program is stored, and when the computer program runs on an electronic device, the electronic device is caused to perform the method according to any one of the first aspect.

In a fifth aspect, the present application provides a computer program product, which includes a computer program, when the computer program runs on an electronic device, causes the electronic device to execute the method according to any one of the first aspect.

According to the method, the device and the storage medium for extracting the primary key, a target table is obtained in response to receiving a primary key extraction request from a client, wherein the primary key extraction request comprises identification information of the target table; performing line number reduction processing on the target table to obtain a first sub-table of the target table; combining fields in the first sub-table, and if the data size of the first sub-table after de-duplication based on the field combination is equal to the data size of the target table after de-duplication, determining the field combination as a main key of the target table; and sending the primary key to the client. The data calculation amount in the primary key extraction process can be greatly reduced by performing line number reduction processing on the target table, so that the primary key of the table can be quickly extracted, the purpose of improving the efficiency of determining the primary key is achieved, and the working efficiency of related personnel is further improved.

These and other aspects of the present application will be more readily apparent from the following description of the embodiment(s).

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.

Fig. 1 is a schematic view of an application scenario of a primary key extraction method provided in the present application;

fig. 2 is a schematic flowchart of a primary key extraction method according to an embodiment of the present application;

FIG. 3 is a flow chart illustrating a row number reduction process according to an embodiment of the present disclosure;

FIG. 4 is a flowchart illustrating field grouping according to an embodiment of the present application;

fig. 5 is a schematic flowchart of a primary key extraction method according to another embodiment of the present application;

fig. 6 is a schematic structural diagram of a primary key extraction device according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a primary key extraction device according to another embodiment of the present application;

fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

With the above figures, there are shown specific embodiments of the present application, which will be described in more detail below. These drawings and written description are not intended to limit the scope of the inventive concepts in any manner, but rather to illustrate the inventive concepts to those skilled in the art by reference to specific embodiments.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.

First, for the sake of understanding, terms are explained in this application:

the Data Warehouse, known in english as Data Warehouse, which may be abbreviated as DW or DWH, is a structured Data environment for decision support systems and online analysis application Data sources. Data warehouses research and solve the problem of obtaining information from databases. Under the intelligent large environment of information technology and data, the data warehouse provides many economic and efficient computing resources in the fields of software and hardware, Internet and intranet solutions and databases, can store a great amount of data for analysis and use, and allows a plurality of data access technologies to be used.

The primary key, is one or more fields in the table whose value is used to uniquely identify a record in the table. In a two-table relationship, the primary key is used to reference a particular record in one table from the other table. The primary key is mainly used for the external key association of other tables and the modification and deletion of the record.

Tables, such as hive tables, hive is a data warehouse tool based on Hadoop, which is used for data extraction, transformation, and loading, and is a mechanism that can store, query, and analyze large-scale data stored in Hadoop. In the embodiment of the present application, a table may have one or more primary keys. For example, the tables mentioned in the embodiments of the present application may be tables in a big data scenario, for example, an order table, an after-sales table, a user table, a merchant table, and other business tables.

The dimension field is a certain characteristic of things or phenomena, and non-numerical fields such as gender, region, time and the like are generally selected as dimension fields and are generally used for making the fields related to tables.

The entropy of information is the average amount of information excluding redundancy from the information.

In general, databases such as mysql and the like are provided with a main key, but the main key is mostly not set when a table is created in a data warehouse, but in the association process, manual duplication removal is needed to verify whether a field is the main key, which brings great inconvenience to data developers.

When some fields in the table are manually determined to be the primary key, the fields in the table cannot be determined empirically without knowing them, so that multiple guesses and verifications are required, for example, in the data development process, if an association between multiple tables is needed or a single table is analyzed, a large amount of time is wasted in searching the primary key. For example, if the table contains n dimensional fields without any experience, the table needs to be totally divided

The secondary verification is performed, and the full amount of data is verified, so that the working efficiency of data developers is greatly influenced.

In view of the foregoing problems, the present application provides a method, an apparatus, and a storage medium for extracting a primary key, which improve the calculation efficiency of primary key extraction by reducing the amount of data included in a table.

Exemplarily, fig. 1 is a schematic view of an application scenario of the primary key extraction method provided in the present application. As shown in fig. 1, the application scenario may include: at least one client (three clients are shown in fig. 1, client 111, client 112, client 113), network 12, and server 13. Wherein each client and server 13 may communicate over network 12.

Illustratively, in practical applications, when a user, such as a data developer, triggers extraction of a primary key of a table through the client 111, the client 111 sends a primary key extraction request to the server 13 through the network 12, wherein the primary key extraction request contains identification information of a target table; correspondingly, the server receives a main key extraction request, acquires a target table according to identification information of the target table carried in the main key extraction, and then performs row number reduction processing on the target table to obtain a first sub-table with the data volume reduced compared with that contained in the target table; and finally, field combination is carried out on each field in the first sub-table, if the data volume of the first sub-table after the duplication removal based on the field combination is equal to the data volume of the target table after the duplication removal, the field combination is determined to be a main key of the target table, the main key is fed back to the client 111 through the network 12, and the relevant information of the main key is displayed to a user through the client 111, so that the user can carry out analysis or inter-table association on the table based on the main key.

It should be noted that fig. 1 is only a schematic diagram of an application scenario provided by the embodiment of the present application, and the embodiment of the present application does not limit the devices included in fig. 1, nor does it limit the positional relationship between the devices in fig. 1, for example, in the application scenario illustrated in fig. 1, a data storage device may also be included, and the data storage device may be an external memory with respect to the server 13, or an internal memory integrated in the server 13. The server 13 may be an independent server, or may be a service cluster or the like.

The technical solution of the present application will be described in detail below with reference to specific examples. It should be noted that the following specific embodiments may be combined with each other, and the same or similar concepts or processes may not be described in detail in some embodiments.

Fig. 2 is a schematic flow chart of a primary key extraction method according to an embodiment of the present application. The embodiment of the application provides a primary key extraction method, which is applied to a primary key extraction device, and the device can be realized in a software and/or hardware mode. Alternatively, in the scenario shown in fig. 1, the primary key extraction device may be integrated in a server, for example, the primary key extraction device is a chip or a circuit in the server; alternatively, the primary key extraction means is a server. Next, a description will be given by taking a server as an execution subject.

As shown in fig. 2, the primary key extraction method includes the following steps:

s201, responding to a received primary key extraction request from a client, and acquiring a target table, wherein the primary key extraction request comprises identification information of the target table.

In practical application, when a data developer and other related personnel have the requirement of acquiring the primary key of the table, corresponding operation is executed at a client, the client responds to the operation and sends a primary key extraction request to a server, and the server is triggered to execute the primary key extraction process.

Accordingly, the server receives the primary key extraction request and responds thereto. Specifically, the server acquires the identification information of the target table contained in the primary key extraction request by analyzing the primary key extraction request; based on the identification information, the server may retrieve the target table with the identification information from an internal storage medium or an external data source. The target table, i.e. the table whose primary key is to be determined, is an object of the primary key extraction performed by the server in the embodiment of the present application.

It will be appreciated that the target table typically contains a plurality of fields, each corresponding to its own field information and corresponding stored data. In general, the field information may include field name, field type, field length, and field description, and in this embodiment, the field information may be at least one of the above information.

As an example, the embodiment of the present application will be described with an order table shown in table 1 as a target table. Referring to table 1, the order table includes 9 fields, and the corresponding field names are: sample _ ord _ id, item _ sku _ id, dept _ id _1, dept _ id _2, cat _1, cat _2, gmv, and ord _ num; the different fields are associated with respective stored data, for example, the column with field name of salejord _ id stores data 1, 5, 9 and 9 corresponding to the field. It should be understood that table 1 is only used to illustrate the meaning of each part therein, and the fields contained in the order table and the corresponding stored data are not limited to table 1.

TABLE 1 order form

sale_ord_id	item_sku_id	dept_id_1	dept_id_2	cate_1	cate_2	gmv	3rd_num
								1	1	1	2	2	2	834.5	453
1	1	2	2	2	3	454.3	65
								1	1	1	3	2	2	45	876
5	5	6	6	7	7	234	76
								5	5	5	6	7	7	3	54
5	6	5	6	7	6	76	4
								5	5	5	6	7	7	65	32
9	8	9	9	8	8	87	76
								9	9	8	9	9	8	987	87
9	9	8	8	8	8	765	98
								9	9	8	9	8	8	5443	98

The field information of each field in table 1 may be as shown in table 2:

TABLE 2

Name of field	Type of universe	Field description
			sale_ord_id	string	Order number
item_sku_id	string	Commodity numbering device
			dept_id_1	string	First-level department numbering
dept_id_2	string	Second level department numbering
			cate_1	string	First class numbering
cate_2	string	Second class numbering
			gmv	double	Sales amount
ord_num	int	Amount of orders

S202, line number reduction processing is carried out on the target table to obtain a first sub-table of the target table.

In a large data scenario, the amount of data contained in the table is large, so this step aims to reduce the amount of data contained in the table, obtain a sub-table of the target table, and name it as "first sub-table" here for distinguishing from other sub-tables, where "first" does not indicate the size of the amount of data contained in the sub-table, but the amount of data of the first sub-table is smaller than the amount of data of the target table.

With regard to the line number reduction processing, it can be understood that the effect of reducing the amount of data contained in the target table is achieved by deleting a line in the target table. Reference may be made to the following embodiments for specific implementations of the line number reduction process.

S203, combining fields in the first sub-table, and if the data size of the first sub-table after the de-duplication is equal to the data size of the target table based on the field combination, determining the field combination as the primary key of the target table.

And combining fields in the first sub-table to obtain a plurality of field combinations. Illustratively, the first sub-table contains a field a, a field b, a field_cAnd the field d, the following field combinations can be obtained by field combination of the four fields:

(a)，(b)，(c)，(d)；

(a，b)，(a，c)，(a，d)，(b，c)，(b，d)，(c，d)；

(a，b，c)，(a，b，d)，(a，c，d)，(b，c，d)；

(a，b，c，d)。

and respectively verifying whether the field combination is a primary key. For example, the data size after the field combination (a) is de-duplicated is compared with the data size after the first sub-table is de-duplicated, and if the two data sizes are the same, the field combination (a) is determined as the primary key; for another example, the data size of the field combination (a, b) after deduplication is compared with the data size of the first sub-table after deduplication, and if the two data sizes are the same, the field combination (a, b) is determined to be the primary key, i.e., the joint primary key.

And S204, sending the primary key to the client.

Specifically, the server sends the primary key to the client. The primary key may be a field name of a field included in the primary key, or a position of the field included in the primary key in the target table, and the like, and is specifically set according to actual requirements.

The client displays the primary key for the reference of related personnel. The embodiment of the present application is not limited to the display mode used by the client to display the primary key. For example, the client may use a color that is distinct from other fields, display a primary key, and so on.

The method for extracting the primary key provided by the embodiment of the application responds to a primary key extraction request received from a client to obtain a target table, wherein the primary key extraction request comprises identification information of the target table; performing line number reduction processing on the target table to obtain a first sub-table of the target table; combining fields in the first sub-table, and if the data size of the first sub-table after de-duplication based on the field combination is equal to the data size of the target table after de-duplication, determining the field combination as a main key of the target table; and sending the primary key to the client. The data volume in the primary key extraction process can be greatly reduced by carrying out line number reduction processing on the target table, so that the primary key of the table can be quickly extracted, the purpose of improving the extraction efficiency of the primary key is achieved, and the working efficiency of related personnel is improved.

Next, how to perform the line number reduction processing is explained by the following embodiment.

As shown in fig. 3, in S202 shown in fig. 2, performing the number-of-lines reduction process on the target table to obtain a first sub-table of the target table, may further include:

s301, determining the information entropy of each row of data in the target table.

In the target table, the data of each instance is stored in the form of a row. Also taking table 1 as an example, where the first row stores data [1,1,1,2,2,2,834.5,453] for each field, represents an example.

And traversing each row of data in the target table, and calculating the information entropy of the row of data.

S302, carrying out row clustering on data of each row in the target table by adopting a first clustering algorithm.

The application is not limited to a specific type of the first Clustering Algorithm, for example, the first Clustering Algorithm may include, but is not limited to, a Density Clustering Algorithm such as a Density-Based Clustering method with Noise (dbss) and a partition Clustering Algorithm such as a K-Means Clustering Algorithm (K-Means Clustering Algorithm, K-Means for short). Compared with the K-Means only suitable for clustering of the convex sample set, the DBSCAN can be suitable for the convex sample set and can also be suitable for the non-convex sample set.

By the line clustering process, similar line data in the target table can be clustered together to perform the reduction of the number of lines by executing S303.

It should be noted that the execution order of S301 and S302 is not limited in the embodiments of the present application. That is, the server may execute these two steps sequentially, for example, execute S301 and then execute S302, or execute S302 and then execute S301; alternatively, the server may execute S301 and S302 in parallel.

S303, extracting the row data of which the corresponding information entropy meets the preset condition in each row cluster to obtain a first sub-table.

Since the row data in each row cluster is similar, the data amount of the target table can be reduced by the reduction of the similar row data. Therefore, in the target table, for each line cluster, a first sub-table is obtained by extracting line data whose information entropy satisfies a preset condition. Or, it can be understood that, in the target table, for each row cluster, the row data of which the information entropy does not meet the preset condition is deleted, so as to obtain a first sub-table; or, in the target table, for each row cluster, keeping the row data of which the information entropy meets the preset condition to obtain a first sub-table.

This embodiment obtains a first sub-table with a smaller data size by hierarchical sampling.

Optionally, the implementation manner of obtaining the first sub-table is different due to different preset conditions. The method comprises the following steps:

in one implementation, the predetermined condition is that the information entropy is greater than the predetermined information entropy. At this time, extracting the line data of each line cluster, of which the corresponding information entropy satisfies the preset condition, to obtain the first sub-table, which may include: and extracting the row data of which the corresponding information entropy is larger than the preset information entropy in each row cluster to obtain a first sub-table. The size of the preset information entropy can be set according to actual requirements.

In another implementation, the predetermined condition is a higher predetermined number of information entropies. At this time, extracting the line data of each line cluster, of which the corresponding information entropy satisfies the preset condition, to obtain the first sub-table, which may include: and extracting the line data corresponding to the information entropies with preset numbers in the front of the information entropies in the order from high to low according to the corresponding information entropies in each line cluster to obtain a first sub-table. The preset number is determined according to the data size of the target table and the category number of the row clusters.

Optionally, the larger the difference between the data size of the target table and the number of categories of the row cluster is, the larger the preset number is. For example, when the difference is larger than a preset value, the preset number may be set larger; when the difference is smaller than the preset value, the preset number may be set smaller.

On the basis of the above embodiment, further, the number of columns of the table can be reduced, so as to further reduce the data volume in the primary key extraction process.

In some embodiments, as shown in fig. 4, combining fields in the first sub-table may include:

s401, performing column clustering on each column of data in the first sub-table by adopting a second clustering algorithm.

The second clustering algorithm and the first clustering algorithm may be the same clustering algorithm, or the two may be different clustering algorithms. For example, the first clustering algorithm and the second clustering algorithm may both be DBSCAN.

By the column clustering process, similar column data in the first sub-table can be clustered together to perform reduction of the number of columns of the table by performing S402.

S402, combining fields among the categories in each column of clusters.

In the step, the fields in the field combination are combined from individual fields selected from each category according to the principle of combination, and the combination does not contain the combination in the categories.

This embodiment considers that fields within categories should be considered relatively similar, fields are less likely to be primary keys when combined, and fields are more likely to be primary keys when combined, and categories differ more. Thus, after field combinations in the category are removed, the calculated amount is greatly reduced.

For example, taking the example that the first sub-table contains four fields, namely, a field a, a field b, a field c, and a field d, two categories are obtained after column clustering is performed on each column of data in the first sub-table: a category containing field a, field b, and field c, and a category containing field d. And combining fields among all categories to obtain the following field combinations:

(a)，(b)，(c)，(d)；

(a,d)，(b,d)，(c,d)；

(a,b,d)，(a,c,d)，(b,c,d)；

(a,b,c,d)。

it is easy to see that, in the above field combinations, field combinations inside the categories (a, b), (b, c), (a, c) and (a, b, c) are not included.

In the method and the device, the associated main keys are combined in a dimension clustering mode in the process of extracting the main keys, invalid combinations of internal dimensions of the same category are removed, and the extraction efficiency of the associated main keys is greatly improved.

Fig. 5 is a schematic flowchart of a primary key extraction method according to another embodiment of the present application. In this embodiment, the primary key extraction method may include:

and S500, receiving a primary key extraction request from a client.

Wherein, the primary key extraction request comprises the identification information of the target table.

S501, acquiring a target table.

The specific implementation of S500 and S501 is similar to S201, and is not described here again.

S502, determining the dimension fields contained in the target table according to the field information of each field in the target table.

Optionally, the step may comprise: traversing each field in the target table, and obtaining a classification result of whether the field is a dimension field according to field information of the field and a classification model, wherein the classification model is used for identifying whether the field is the dimension field; and extracting the fields of which the classification results are dimension fields to obtain the dimension fields contained in the target table.

After the target table is obtained, the fields contained in the target table are known, and the dimension fields contained in the target table are determined according to the field information of the fields. The classification model is a model which is trained in advance and meets the convergence condition. In the training process, whether the field is marked by a manual marking mode is a dimension field, for example, 1 represents that the field is the dimension field, and 0 represents that the field is not the dimension field (non-dimension field). Thus, if the output of the classification model is 1, determining the corresponding field as a dimension field; and if the output of the classification model is 0, determining that the corresponding field is a non-dimensional field.

Usually, before classifying a field using a classification model, a parameterization process is required for the field information, i.e. converting the field information into a numerical value. Therefore, as a possible implementation manner, obtaining a classification result of whether a field is a dimension field according to the field information of the field and the classification model may include: mapping field information of the fields into word vectors by adopting a preset mapping algorithm; and inputting the word vector into a classification model to obtain a classification result of whether the field is a dimension field.

The mapping of the field information of the field into the word vector by using a preset mapping algorithm may include at least one of the following:

1) if the field information contains the field name, a Term Frequency-Inverse text Frequency index (TF-IDF) algorithm is adopted to map the field name into a first vector.

The TF-IDF algorithm is a commonly used weighting technique for information retrieval and data mining to evaluate the importance of a word to one of a set of documents or a corpus.

2) And if the field type of the field information packet is the field type, mapping the field type into a second vector by adopting a word frequency-inverse text frequency index algorithm.

3) And if the field information contains the field description, mapping the field description into a third vector by adopting a word2vec algorithm.

The Word2vec algorithm is a group of related models for generating Word vectors, and these models may be neural networks, for example.

The word vector is obtained by splicing at least one of the first vector, the second vector and the third vector. For example, when the field information only includes a field name, a field type, or a field description, the finally obtained word vector is the corresponding vector, for example, when the field information only includes a field name, the word vector is the first vector; when the field information contains at least two of the field name, the field type and the field description, the finally obtained word vector is the concatenation of the corresponding vectors, for example, if the field information contains the field name and the field type, the word vector is the concatenation of the first vector and the second vector, or if the field information contains the field name, the field type and the field description, the word vector is the concatenation of the first vector, the second vector and the third vector. Those skilled in the art can understand that, for the acquisition of the word vector, the acquisition manners corresponding to the application phase and the training phase are the same, for example, the word vector in the training phase is obtained by splicing the first vector and the second vector, and the word vector in the application phase is also obtained by splicing the first vector and the second vector of the corresponding field.

For example, for table 1 as described above, when the fields in table 1 are classified, the vectors and classification results involved therein may be as shown in table 3:

TABLE 3

Name of field

Type of universe

Field description

tiidf

word2vec

embedding

label

sale_ord_id

string

Order number

[3，2，1，43，422]

[13，32，41，343，422]

[3，2，1，43，422，13，32，41，343，422]

1

item_sku_id

string

Commodity numbering device

[4，24，22，43，22]

[24，324，242，433，242]

[4，24，22，43，22，24，324，242，433，242]

1

dept_id_1

string

First-level department numbering

[65，3，554.657，43]

[32，4，54，33，33]

[65，3，554，657，43，32，4，54，33，33]

1

dept_id_2

string

Second level department numbering

[3221，32，45，432]

[21，32，45，432]

[3221，32，45，432，21，32，45，432]

1

cate_1

string

First class numbering

[8231，532，15，362]

[31，532，15，362]

[8231，532，15，362，31，532，15，362]

1

cate_2

string

Second class numbering

[731，332，235，1562]

[931，332，235，1562]

[731，332，235，1562，931，332，235，1562]

1

gmv

double

Sales amount

[5431，1232，35，562]

[531，123，35，562]

[5431，1232，35，562，531，123，35，562]

0

ord_num

int

Amount of orders

[7331，1232，235，562]

[733，12，235，562]

[7331，1232，235，562，733，12，235，562]

0

In table 3, imbedding is used to represent a word vector as a feature of the classification model, label represents whether a field is a classification result of a dimension field, 1 represents that the field is a dimension field, and 0 represents that the field is not a dimension field (non-dimension field).

S503, extracting data correspondingly stored in the dimension field in the target table to obtain a second sub-table.

Since the non-dimensional fields are not generally used to be associated with other tables, and the non-dimensional fields are not used as primary keys of the tables, such as the field with the field name gmv in table 3, the non-dimensional fields can be eliminated when the primary keys are extracted.

Illustratively, data stored in correspondence with the dimension field in the target table shown in table 1 is extracted, and a second sub-table shown in table 4 is obtained:

TABLE 4

sale_ord_id	item_sku_id	dept_id_1	dept_id_2	cate_1	cate_2
						1	1	1	2	2	2
1	1	2	2	2	3
						1	1	1	3	2	2
5	5	6	6	7	7
						5	5	5	6	7	7
5	6	5	6	7	6
						5	5	5	6	7	7
9	8	9	9	8	8
						9	9	8	9	9	8
9	9	8	8	8	8
						9	9	8	9	8	8

S504, line number reduction processing is carried out on the second sub-table to obtain a first sub-table of the target table.

It is understood that S502 to S504 are further refinements of S202 in the steps shown in fig. 2.

For S504, the specific implementation may refer to the implementation of "performing row number reduction processing on the target table to obtain the first sub-table of the target table," and replace the "target table" with the "second sub-table. For example, the information entropy (entropy) of each row of data in the first sub-table as shown in table 4 is determined, resulting in table 5:

TABLE 5

sale_ord_id	item_sku_id	dept_id_1	dept_id_2	cate_1	cate_2	entropy
							1	1	1	2	2	2	1
1	1	2	2	2	3	1.459148
							1	1	1	3	2	2	1.459148
5	5	6	6	7	7	1.584963
							5	5	5	6	7	7	1.459148
5	6	5	6	7	6	1.459148
							5	5	5	6	7	7	1.459148
9	8	9	9	8	8	1
							9	9	8	9	9	8	0.918296
9	9	8	9	9	8	0.918296
							9	9	8	9	8	8	1

And performing row clustering on data of each row in the first sub-table by adopting a first clustering algorithm to obtain a table 6:

TABLE 6

sale_ord_id	item_sku_id	dept_id_1	dept_id_2	cate_1	cate_2	entropy	cluster
								1	1	1	2	2	2	1	1
1	1	2	2	2	3	1.459148	1
								1	1	1	3	2	2	1.459148	1
5	5	6	6	7	7	1.584963	2
								5	5	5	6	7	7	1.459148	2
5	6	5	6	7	6	1.459148	2
								5	5	5	6	7	7	1.459148	2
9	8	9	9	8	8	1	3
								9	9	8	9	9	8	0.918296	3
9	9	8	9	9	8	0.918296	3
								9	9	8	9	8	8	1	3

In table 6, "cluster" is a cluster identifier, and there are three rows of clusters: 1,2,3.

And then, extracting the row data of which the corresponding information entropy meets the preset condition in each row cluster to obtain a first sub-table. For example, n rows of data with the largest entropy in each row cluster are extracted (as shown in table 7), which correspond to rows 1,2, 3, 7, and 10, respectively, to obtain table 8:

TABLE 7

sale_ord_id	item_sku_id	dept_id_1	dept_id_2	cate_1	cate_2	entropy	cluster
								1	1	1	2	2	2	1	1
1	1	2	2	2	3	1.459148	1
								1	1	1	3	2	2	1.459148	1
5	5	6	6	7	7	1.584963	2
								5	5	5	6	7	7	1.459148	2
5	6	5	6	7	6	1.459148	2
								5	5	5	6	7	7	1.459148	2
9	8	9	9	8	8	1	3
								9	9	8	9	9	8	0.918296	3
9	9	8	8	8	8	0.918296	3
								9	9	8	9	8	8	1	3

TABLE 8

sale_ord_id	item_sku_id	dept_id_1	dept_id_2	cate_1	cate2	entropy	cluster
								1	1	2	2	2	3	1.459148	1
1	1	1	3	2	2	1.459148	1
								5	5	6	6	7	7	1.584963	2
9	8	9	9	8	8	1	3
								9	9	8	9	8	8	1	3

And S505, combining fields in the first sub-table, and if the data size of the first sub-table after the duplication removal based on the field combination is equal to the data size of the target table after the duplication removal, determining the field combination as the primary key of the target table.

Illustratively, a second clustering algorithm is adopted to perform column clustering on each column of data in the first sub-table shown in table 8, so as to obtain table 9:

TABLE 9

sale_ord_id	item_sku_id	dept_id_1	dept_id_2	cate_1	cate_2	entropy	cluster
								1	1	2	2	2	3	1.459148	1
1	1	1	3	2	2	1.459148	1
								5	5	6	6	7	7	1.584963	2
9	8	9	9	8	8	1	3
								9	9	8	9	8	8	1	3

Referring to table 9, the column clustered categories include:

(sale_ord_id，item_sku_id)；

(dept_id_1，dept_id_2)；

(cate_1，cate_2)。

then, combining fields among categories in each column of clusters to obtain:

(sale_ord_id，item_sku_id，dept_id_1，dept_id_2)；

(sale_ord_id，item_sku_id，dept_id_1)；

(sale_ord_id，item_sku_id，dept_id_2)；

(sale_ord_id，dept_id_1，dept_id_2)；

and finally, verifying the primary key, and after the field combination is subjected to deduplication, if the data volume is the same as the data volume after deduplication of all the fields, considering the field combination as a joint primary key.

Through the deduplication calculation, the following results are found: and (sample _ ord _ id, item _ sku _ id, dept _ id _1 and dept _ id _2) and (item _ sku _ id, dept _ id _1 and dept _ id _2) meet the requirements, and the combination of the two groups of fields is determined as the primary key of the target table.

S506, sending the primary key to the client.

Alternatively, assuming that there is no field combination that satisfies the requirement currently, the process of row clustering in S504 and column clustering in S505 is repeated until the primary key of the target table is extracted.

The embodiment of the application has at least the following beneficial effects:

(1) the number of rows and columns of the target table is reduced by methods such as classification models, row data information entropies of the calculation table, clustering and the like, so that the data volume of the calculation process table is greatly reduced, and the calculation efficiency of the primary key extraction is improved.

(2) In the process of extracting the primary keys, the associated primary keys are combined in a clustering mode, invalid combinations of fields in the same category are removed, and the primary key extraction efficiency is greatly improved.

The following are embodiments of the apparatus of the present application that may be used to perform embodiments of the method of the present application. For details which are not disclosed in the embodiments of the apparatus of the present application, reference is made to the embodiments of the method of the present application.

Fig. 6 is a schematic structural diagram of a primary key extraction device according to an embodiment of the present application. The embodiment of the application provides a primary key extraction device, which can be integrated on electronic equipment such as a server. As shown in fig. 6, the primary key extraction device 60 includes: an acquisition module 61, a processing module 62, a verification module 63 and a sending module 64. Wherein:

the acquiring module 61 is configured to acquire a target table in response to receiving a primary key extraction request from a client, where the primary key extraction request includes identification information of the target table;

the processing module 62 is configured to perform line number reduction processing on the target table to obtain a first sub-table of the target table;

the verification module 63 is configured to perform field combination on each field in the first sub-table, and determine that the field combination is a primary key of the target table if the data size of the first sub-table after the deduplication based on the field combination is equal to the data size of the target table after the deduplication;

and a sending module 64, configured to send the primary key to the client.

The apparatus provided in the embodiment of the present application may be used to execute the method in the embodiment shown in fig. 2, and the implementation principle and the technical effect are similar, which are not described herein again.

Referring to fig. 7, further to the structure shown in fig. 6, the processing module 62 may include:

the first determining submodule 621 is configured to determine information entropy of each row of data in the target table;

a clustering submodule 622, configured to perform row clustering on data in each row in the target table by using a first clustering algorithm;

the first extraction sub-module 623 is configured to extract line data in each line cluster, where corresponding information entropy meets a preset condition, to obtain a first sub-table.

In some embodiments, the first extraction sub-module 621 may be specifically configured to: extracting the row data of which the corresponding information entropy is larger than the preset information entropy in each row cluster to obtain a first sub-table; or, according to the sequence from high to low of the corresponding information entropies in each row cluster, extracting the row data corresponding to the information entropies with the information entropies in the preset number in the front of the information entropy sequence to obtain a first sub-table, wherein the preset number is determined according to the data quantity of the target table and the category number of the row clusters.

Optionally, the larger the difference between the data size of the target table and the number of categories of the row cluster is, the larger the preset number is.

In some embodiments, when the verification module 63 combines fields in the first sub-table, it may specifically be configured to: performing column clustering on each column of data in the first sub-table by adopting a second clustering algorithm; and combining fields among the categories in each column of clusters.

Further, the processing module 62 may include:

the second determining submodule 624 is configured to determine, according to field information of each field in the target table, a dimension field included in the target table;

the second extraction submodule 625 is configured to extract data stored in the target table corresponding to the dimension field, so as to obtain a second sub-table;

the processing sub-module 626 is configured to perform line number reduction processing on the second sub-table to obtain a first sub-table of the target table.

Optionally, the second determining submodule 624 may be specifically configured to: traversing each field in the target table, and obtaining a classification result of whether the field is a dimension field according to field information of the field and a classification model, wherein the classification model is used for identifying whether the field is the dimension field; and extracting the fields of which the classification results are dimension fields to obtain the dimension fields contained in the target table.

In some embodiments, when the second determining sub-module 624 is configured to obtain a classification result of whether a field is a dimension field according to the field information of the field and the classification model, the second determining sub-module may be specifically configured to: mapping field information of the fields into word vectors by adopting a preset mapping algorithm; and inputting the word vector into a classification model to obtain a classification result of whether the field is a dimension field.

Further, when the second determining submodule 624 is configured to map the field information of the field into a word vector by using a preset mapping algorithm, the second determining submodule may be specifically configured to at least one of:

It should be noted that the division of the modules of the above apparatus is only a logical division, and the actual implementation may be wholly or partially integrated into one physical entity, or may be physically separated. And these modules can be realized in the form of software called by processing element; or may be implemented entirely in hardware; and part of the modules can be realized in the form of calling software by the processing element, and part of the modules can be realized in the form of hardware. For example, the processing module may be a processing element separately set up, or may be implemented by being integrated in a chip of the apparatus, or may be stored in a memory of the apparatus in the form of program code, and a function of the processing module may be called and executed by a processing element of the apparatus. Other modules are implemented similarly. In addition, all or part of the modules can be integrated together or can be independently realized. The processing element described herein may be an integrated circuit having signal processing capabilities. In implementation, each step of the above method or each module above may be implemented by an integrated logic circuit of hardware in a processor element or an instruction in the form of software.

For example, the above modules may be one or more integrated circuits configured to implement the above methods, such as: one or more Application Specific Integrated Circuits (ASICs), or one or more microprocessors (DSPs), or one or more Field Programmable Gate Arrays (FPGAs), among others. For another example, when one of the above modules is implemented in the form of a Processing element scheduler code, the Processing element may be a general-purpose processor, such as a Central Processing Unit (CPU) or other processor capable of calling program code. For another example, these modules may be integrated together and implemented in the form of a System-On-a-Chip (SOC).

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

Fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application. As shown in fig. 8, the electronic device may include: a processor 81, a memory 82, a communication interface 83, and a system bus 84. Wherein, the memory 82 and the communication interface 83 are connected to the processor 81 through the system bus 84 and complete mutual communication, the memory 82 is used for storing instructions, the communication interface 83 is used for communicating with other devices, and the processor 81 is used for calling the instructions in the memory to execute the scheme as described in the above embodiment of the primary key extraction method.

The system bus 84 mentioned in fig. 8 may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The system bus 84 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

The communication interface 83 is used to enable communication between the database access device and other devices (e.g., clients, read-write libraries, and read-only libraries).

The Memory 82 may include a Random Access Memory (RAM) and may also include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory.

The Processor 81 may be a general-purpose Processor, including a central processing unit, a Network Processor (NP), and the like; but also a digital signal processor DSP, an application specific integrated circuit ASIC, a field programmable gate array FPGA or other programmable logic device, discrete gate or transistor logic, discrete hardware components.

An embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when the computer program runs on an electronic device, the electronic device is enabled to execute the primary key extraction method according to any of the above method embodiments.

The embodiment of the present application further provides a chip for executing the instruction, where the chip is used to execute the primary key extraction method in any of the above method embodiments.

Embodiments of the present application further provide a computer program product, which includes a computer program stored in a computer-readable storage medium, from which the computer program can be read by at least one processor, and the at least one processor can implement the primary key extraction method according to any one of the above method embodiments when executing the computer program.

In the present application, "at least one" means one or more, "a plurality" means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone, wherein A and B can be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship; in the formula, the character "/" indicates that the preceding and following related objects are in a relationship of "division". "at least one of the following" or similar expressions refer to any combination of these items, including any combination of the singular or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or multiple.

It is to be understood that the various numerical references referred to in the embodiments of the present application are merely for descriptive convenience and are not intended to limit the scope of the embodiments of the present application. In the embodiment of the present application, the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiment of the present application.

Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.

Claims

1. A primary key extraction method is characterized by comprising:

the method comprises the steps of responding to a received primary key extraction request from a client to obtain a target table, wherein the primary key extraction request comprises identification information of the target table;

performing line number reduction processing on the target table to obtain a first sub-table of the target table;

combining fields in the first sub-table, and if the data size of the first sub-table after the de-duplication is equal to the data size of the target table after the de-duplication based on the field combination, determining the field combination as a primary key of the target table;

and sending the primary key to the client.

2. The method of claim 1, wherein the performing a row number reduction process on the target table to obtain a first sub-table of the target table comprises:

determining the information entropy of each row of data in the target table;

adopting a first clustering algorithm to perform row clustering on data of each row in the target table;

and extracting the row data of each row cluster, wherein the corresponding information entropy of each row cluster meets the preset condition to obtain the first sub-table.

3. The method for extracting a primary key according to claim 2, wherein the extracting of the row data in each row cluster, which have corresponding information entropy satisfying a preset condition, to obtain the first sub-table includes:

extracting the row data of which the corresponding information entropy is larger than the preset information entropy in each row cluster to obtain the first sub-table;

or, according to the sequence from high to low of the corresponding information entropy in each row cluster, extracting row data corresponding to a preset number of information entropies with information entropies sorted in the front to obtain the first sub-table, wherein the preset number is determined according to the data size of the target table and the number of the row clusters.

4. The primary key extraction method according to claim 3, wherein the larger the difference between the data size of the target table and the number of categories of the row cluster is, the larger the preset number is.

5. The method of claim 1, wherein the combining fields in the first sub-table comprises:

performing column clustering on each column of data in the first sub-table by adopting a second clustering algorithm;

and combining fields among the categories in each column of clusters.

6. The primary key extraction method according to any one of claims 1 to 5, wherein the performing line number reduction processing on the target table to obtain a first sub-table of the target table includes:

determining dimension fields contained in the target table according to field information of each field in the target table;

extracting data correspondingly stored in the dimension field in the target table to obtain a second sub-table;

and performing line number reduction processing on the second sub-table to obtain a first sub-table of the target table.

7. The method of claim 6, wherein the determining the dimension fields included in the target table according to the field information of each field in the target table comprises:

traversing each field in the target table, and obtaining a classification result of whether the field is a dimension field according to field information of the field and a classification model, wherein the classification model is used for identifying whether the field is the dimension field;

and extracting fields of which the classification results are dimension fields to obtain the dimension fields contained in the target table.

8. The method of claim 7, wherein obtaining a classification result of whether the field is a dimension field according to the field information of the field and a classification model comprises:

mapping the field information of the field into a word vector by adopting a preset mapping algorithm;

and inputting the word vector into the classification model to obtain a classification result of whether the field is a dimension field.

9. The method for extracting a primary key according to claim 8, wherein the mapping the field information of the field into a word vector by using a preset mapping algorithm includes at least one of:

if the field information contains a field name, mapping the field name into a first vector by adopting a word frequency-inverse text frequency index algorithm, wherein the preset mapping algorithm comprises the word frequency-inverse text frequency index algorithm;

if the field type of the field information packet is the word frequency-inverse text frequency index algorithm, mapping the field type into a second vector, wherein the preset mapping algorithm comprises the word frequency-inverse text frequency index algorithm;

if the field information contains field description, adopting a word2vec algorithm to map the field description into a third vector, wherein the preset mapping algorithm comprises the word2vec algorithm;

the word vector is obtained by splicing at least one of the first vector, the second vector and the third vector.

10. A primary key extraction device characterized by comprising:

the system comprises an acquisition module, a storage module and a processing module, wherein the acquisition module is used for responding to a received primary key extraction request from a client and acquiring a target table, and the primary key extraction request comprises identification information of the target table;

and the sending module is used for sending the primary key to the client.

11. An electronic device comprising a processor, a memory, and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any of claims 1 to 9 when executing the computer program.

12. A computer-readable storage medium, in which a computer program is stored which, when run on an electronic device, causes the electronic device to perform the method of any one of claims 1 to 9.

13. A computer program product comprising a computer program, characterized in that the computer program, when run on an electronic device, causes the electronic device to perform the method according to any of claims 1 to 9.