CN113379004B

CN113379004B - Data table classification method and device, electronic equipment and storage medium

Info

Publication number: CN113379004B
Application number: CN202110844039.1A
Authority: CN
Inventors: 张霖云
Original assignee: Zhejiang Dahua Technology Co Ltd
Current assignee: Zhejiang Dahua Technology Co Ltd
Priority date: 2021-07-26
Filing date: 2021-07-26
Publication date: 2023-04-14
Anticipated expiration: 2041-07-26
Also published as: CN113379004A

Abstract

The embodiment of the application provides a data table classification method, a data table classification device, electronic equipment and a storage medium, relates to the technical field of image processing, and is used for improving the accuracy of data table classification; the method comprises the following steps: acquiring characteristic information in a data table to be classified; the characteristic information comprises field information, table name information and source information, and the source information is used for representing a source system corresponding to the data table to be classified; vectorizing the characteristic information to obtain a characteristic vector; inputting the feature vectors into the trained classification model; classifying the feature vectors based on the trained classification model to obtain the correlation matching degree of at least one data table to be classified and the set data table type; and determining the data table type corresponding to the data table to be classified according to the correlation matching degree. According to the method and the device, the data table type corresponding to the data table to be classified is determined based on the field information, the table name information and the source information of the data table, so that the efficiency and the accuracy of data table classification are improved.

Description

Data table classification method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a method and an apparatus for classifying a data table, an electronic device, and a storage medium.

Background

With the continuous development of informatization, a large amount of data is generated, and the large amount of data is stored in different tables in an information system. Before the data is fully utilized, the data needs to be treated, and then the metadata table stored in the information system is classified as an indispensable step.

At present, the existing technical scheme for classifying data tables is to manually classify metadata tables in a system. This method of classifying metadata tables is basically dependent on human subjective experience and is inefficient.

Disclosure of Invention

The embodiment of the application provides a data table classification method and device, electronic equipment and a storage medium, and the efficiency of data table classification can be improved.

In a first aspect, an embodiment of the present application provides a data table classification method, where the method includes:

acquiring characteristic information in a data table to be classified; the characteristic information comprises field information, table name information and source information, the table name information is used for representing the name of the data table to be classified, and the source information is used for representing a source system corresponding to the data table to be classified;

vectorizing the characteristic information to obtain a characteristic vector;

inputting the feature vectors into a trained classification model;

classifying the feature vectors based on the trained classification model to obtain the correlation matching degree of at least one to-be-classified data table and a set data table type;

and determining the data table type corresponding to the data table to be classified according to the correlation matching degree.

The characteristic vector is obtained based on the field information, the table name information and the source information of the data table to be classified, so that the characteristic vector can completely represent the characteristics of the data table to be detected; after the obtained feature vectors are input into the trained classification model, at least one correlation matching degree is obtained based on the trained classification model, and the data table type corresponding to the data table to be classified can be accurately determined according to the correlation matching degree, so that the efficiency and the accuracy of determining the data table type corresponding to the data table to be classified are improved.

An optional implementation manner is that, the determining, according to the correlation matching degree, a data table type corresponding to the data table to be classified includes:

if the correlation matching degree is one, taking the data table type corresponding to the correlation matching degree as the data table type corresponding to the data table to be classified; or,

and if the correlation matching degrees are multiple, taking the data table type corresponding to the correlation matching degree meeting the set condition as the data table type corresponding to the data table to be classified.

In an optional embodiment, after outputting the associated matching degree of at least one of the data tables to be classified and the set data table type, the method further includes:

if the correlation matching degree is one, displaying the data sheet to be classified and the data sheet type corresponding to the correlation matching degree in a display interface so that a user can determine whether the data sheet type corresponding to the data sheet to be classified is correct or not;

and if the determining instruction triggered by the user is not received within the set time period, taking the data table type corresponding to the relevant matching degree as the data table type corresponding to the data table to be classified.

According to the method and the device, when the obtained correlation matching degree is determined to be one, the data sheet to be classified and the data sheet type corresponding to the correlation matching degree are displayed in the display interface, so that a user can determine whether the data sheet type corresponding to the data sheet to be classified is the data sheet type corresponding to the correlation matching degree, the data sheet type corresponding to the data sheet to be classified is determined manually, and the accuracy of classification of the data sheet to be classified is improved.

In an optional embodiment, after determining the relevant matching degree of the data table to be classified and the set data table type, the method further includes:

if the correlation matching degrees are multiple, displaying the data sheet to be classified, the set data sheet type and the correlation matching degree of the data sheet to be classified and the set data sheet type in a display interface so that a user can determine the data sheet type corresponding to the data sheet to be classified from the set data sheet type;

and if the user does not determine the data table type corresponding to the data table to be classified within the set time period, taking the data table type corresponding to the relevant matching degree meeting the set condition as the data table type corresponding to the data table to be classified.

According to the method and the device, when the obtained multiple related matching degrees are determined, the data sheet to be classified, the set data sheet type and the related matching degree of the data sheet to be classified and the set data sheet type are displayed in the display interface, so that a user can determine the data sheet type corresponding to the data sheet to be classified from the set data sheet type, and the data sheet type corresponding to the data sheet to be classified is determined manually, so that the classification accuracy of the data sheet to be classified is improved.

In an alternative embodiment, the classification model is trained by:

acquiring a sample data set; the sample data set comprises characteristic information of a plurality of sample data tables and a data table type corresponding to each sample data table;

according to the sample data set, performing loop iterative training on the classification model, and outputting the trained classification model after the training is finished; wherein, the iterative training process in one loop comprises the following operations:

inputting the characteristic information of the sample data table into the classification model, and obtaining the reference data table type of the sample data table based on the classification model;

and determining a loss value based on the reference data table type of the sample data table and the data table type corresponding to the sample data table in the sample data set, and performing parameter adjustment on the classification model according to the loss value.

In an optional embodiment, the sample data set is obtained by:

obtaining a metadata table in a database;

performing feature extraction on the metadata table to obtain feature information of the metadata table;

performing clustering analysis on the characteristic information to obtain a clustering result;

matching the clustering result with the set data table type, and taking the data table type corresponding to the clustering result as the data table type corresponding to the metadata table corresponding to the clustering result;

and taking the characteristic information of the metadata table and the data table type corresponding to the metadata table as the characteristic information of the sample data table and the data table type corresponding to the sample data table to obtain the sample data set.

According to the embodiment of the application, the metadata table is subjected to cluster analysis, the data table type corresponding to the metadata table is determined, the characteristic information obtained when the data table type corresponding to the metadata table is obtained, the obtained characteristic information of the metadata table and the data table type corresponding to the metadata table are used as the characteristic information of the sample data table and the data table type corresponding to the sample data table, the sample data set is obtained, a large number of samples are provided for training a classification model, the situation that the samples are few when the classification model is trained is avoided, the samples are manually obtained, and therefore the efficiency of obtaining the sample data set containing a large number of samples is improved.

An optional implementation manner is that, the matching the clustering result with the set data table type, and taking the data table type corresponding to the clustering result as the data table type corresponding to the metadata table corresponding to the clustering result, includes:

if the clustering result is successfully matched with the set data table type, taking the data table type corresponding to the clustering result as the data table type corresponding to the metadata table corresponding to the clustering result; or,

and if the clustering result is unsuccessfully matched with the set data table type, adjusting the characteristic information until the clustering result obtained based on the adjusted characteristic information is successfully matched with the set data table type.

According to the embodiment of the application, after the clustering result obtained based on the characteristic information is determined to be failed to be matched with the set data table type, the characteristic information is adjusted until the clustering result obtained based on the adjusted characteristic information is successfully matched with the set data table type, so that the effective extraction of the characteristic information is ensured, and the accuracy of data table classification is improved.

An optional implementation manner is that the vectorizing the feature information to obtain a feature vector includes:

preprocessing the characteristic information, and eliminating null data contained in the characteristic information to obtain processed characteristic information;

vectorizing the processed feature information to obtain a feature vector.

According to the embodiment of the application, null data in the feature information needs to be removed before vectorization of the feature information, so that the accuracy of the data is guaranteed, and the accuracy of data table classification is improved.

In a second aspect, an embodiment of the present application provides a data table sorting apparatus, where the apparatus includes:

the acquisition unit is used for acquiring the characteristic information in the data table to be classified; the characteristic information comprises field information, table name information and source information, the table name information is used for representing the name of the data table to be classified, and the source information is used for representing a source system corresponding to the data table to be classified;

the processing unit is used for vectorizing the characteristic information to obtain a characteristic vector;

a determination unit for inputting the feature vectors to a trained classification model; classifying the feature vectors based on the trained classification model to obtain the correlation matching degree of at least one to-be-classified data table and a set data table type; and determining the data table type corresponding to the data table to be classified according to the correlation matching degree.

An optional implementation manner is that the determining unit is specifically configured to:

In an optional embodiment, after outputting the relevant matching degree of at least one of the data tables to be classified and the set data table type, the device further comprises a display unit;

the display unit is specifically configured to:

the determination unit is further configured to:

and if the determining instruction triggered by the user is not received within a set time period, taking the data table type corresponding to the correlation matching degree as the data table type corresponding to the data table to be classified.

In an optional embodiment, after outputting the relevant matching degree of at least one to-be-classified data table and the set data table type, the device further comprises a display unit;

the display unit is specifically configured to:

the determination unit is further configured to:

In an alternative embodiment, the apparatus further comprises a training unit; the training unit is specifically configured to:

In an optional embodiment, the apparatus further comprises a generating unit; the generating unit is specifically configured to:

obtaining a metadata table in a database;

An optional implementation manner is that the generating unit is specifically configured to:

An optional implementation manner is that the processing unit is specifically configured to:

vectorizing the processed feature information to obtain a feature vector.

In a third aspect, an embodiment of the present application provides an electronic device, including a memory and a processor, where the memory stores a computer program that is executable on the processor, and when the computer program is executed by the processor, the method for classifying a data table of any one of the above first aspects is implemented.

In a fourth aspect, the present application provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method for sorting light in a data table in any one of the first aspect are implemented.

For technical effects brought by any one implementation manner in the second aspect to the fourth aspect, reference may be made to technical effects brought by a corresponding implementation manner in the first aspect, and details are not described here.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings required to be used in the description of the embodiments will be briefly introduced below, and it is apparent that the drawings in the description below are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings may be obtained according to the drawings without inventive labor.

FIG. 1 is a schematic flowchart of a data table classifying method according to an embodiment of the present application;

fig. 2 is a schematic diagram of a display interface provided in an embodiment of the present application;

FIG. 3 is a schematic view of another display interface provided in an embodiment of the present application;

FIG. 4 is a schematic view of a complete flow chart of a data table classification method according to an embodiment of the present application;

FIG. 5 is a schematic structural diagram of a data table sorting apparatus according to an embodiment of the present application;

fig. 6 is a block diagram of an electronic device according to an embodiment of the present disclosure.

Detailed Description

In order to make the technical solutions of the present disclosure better understood, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the accompanying drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the application, as detailed in the appended claims.

Some terms appearing herein are explained below:

(1) Clustering analysis: refers to an analytical process that groups a collection of physical or abstract objects into classes that are composed of similar objects. It is an important human behavior. The goal of cluster analysis is to collect data on a similar basis for classification. Clustering is derived from many fields, including mathematics, computer science, statistics, biology and economics. In different application fields, many clustering techniques have been developed, and these techniques are used to describe data, measure the similarity between different data sources, and classify data sources into different clusters. Including non-hierarchical clustering and density clustering.

(2) K-means method: the method is a non-hierarchical clustering method, and k partitions are created, wherein k is the number of the partitions to be created; a circular localization technique is then utilized to help improve partition quality by moving objects from one partition to another.

(3) DBSCAN (Densit-based Spatial Clustering of Application with Noise, density-based Clustering method with Noise): is a spatial clustering algorithm based on density. The algorithm divides the area with sufficient density into clusters and finds arbitrarily shaped clusters in a noisy spatial database, which defines clusters as the largest set of density-connected points.

(4) TF-IDF (Term Frequency-Inverse text Frequency) is a commonly used weighting technique for information retrieval and data mining. TF is Term Frequency (Term Frequency), and IDF is Inverse text Frequency index (Inverse Document Frequency). TF-IDF is a statistical method to evaluate the importance of a word to one of a set of documents or a corpus. The importance of a word increases in proportion to the number of times it appears in a document, but at the same time decreases in inverse proportion to the frequency with which it appears in the corpus. Various forms of TF-IDF weighting are often applied by search engines as a measure or rating of the degree of relevance between a document and a user query.

In the embodiment of the present application, the term "and/or" describes an association relationship of associated objects, and indicates that three relationships may exist, for example, a and/or B, and may indicate: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

At present, the existing technical scheme for data table classification is to manually classify metadata tables in a system. This method of classifying metadata tables is basically dependent on human subjective experience and is inefficient.

Based on the above problems, embodiments of the present application provide a data table classification method and apparatus, an electronic device, and a storage medium. The data sheet classification method can be applied to a terminal, such as a computer; but also to a server.

As shown in fig. 1, an embodiment of the present application provides a data table classification method, including the following steps:

step S101, acquiring characteristic information in a data table to be classified.

Note that the feature information includes field information, table name information, and source information. The table name information is used for representing the name of the data table to be classified; the source information is used for representing a source system corresponding to the data table to be classified.

According to the method and the device, the data table to be classified can be obtained from the database, and the feature information of the obtained data table to be classified is extracted.

In some embodiments, after the metadata table acquired from the database is used as the data table to be classified, table name information, field information and source information in the data table to be classified are acquired, and the acquired information is used as feature information of the data table to be classified.

In some embodiments, the field information in the embodiments of the present application includes, but is not limited to, chinese field information and english field information; for example, the Chinese field information may be name, age, license plate number, color, case number, etc.; the English field information can be xm, nl, cph, ajbh, etc. The table name information includes, but is not limited to: chinese table name information and English table name information; for example, the Chinese table name information may be a permanent population, a temporary population, vehicle information, ktv information, and the like; the English table name information can be czrk, clxx and the like.

For example, the embodiment of the application may use the metadata table obtained from the database as the data table to be classified, and obtain the feature information of the data table to be classified; the characteristic information includes field information (table data, such as name, age, license plate number, case label), table name information (such as temporary population, permanent population, vehicle information), and source information (such as system a and system B).

And step S102, vectorizing the characteristic information to obtain a characteristic vector.

After the characteristic information of the data table to be classified is obtained, the characteristic information is preprocessed, null value data contained in the characteristic information is removed, and the processed characteristic information is obtained; and vectorizing the processed feature information to obtain a feature vector.

In specific implementation, after the characteristic information is obtained, the characteristic information needs to be preprocessed, and some null value data in the characteristic information are removed to obtain the processed characteristic information; and performing vectorization processing on the processed feature information through a TF-IDF algorithm to obtain a feature vector.

In implementation, the embodiment of the present application may obtain the feature vector in the following manner.

According to the method and the device, the importance of the words in the feature information is determined according to the TF-IDF algorithm, and the feature vector is obtained based on the obtained importance.

In some embodiments, the method includes performing word segmentation processing on feature information to obtain a plurality of words; and determining the total number of words included in the characteristic information and the number of times each word appears in the characteristic information.

In specific implementation, the word frequency corresponding to each word is determined according to the number of times that each word appears in the feature information and the total number of words.

In some embodiments, the word frequency may be determined by the following formula:

wherein, TF represents word frequency; x is the number of _i Representing the number of times of the ith word appearing in the feature information; x represents the total number of words.

After the word frequency is determined, the inverse document frequency corresponding to each word is determined according to the total number of documents stored in the preset corpus and the number of documents containing the word in the corpus.

In some embodiments, the present application embodiments may determine the inverse document frequency by the following formula:

wherein, the IDF represents the inverse document frequency corresponding to the word; n represents the total number of documents in the corpus; n (x) represents the number of documents containing the word x.

After the word frequency and the inverse document frequency corresponding to each word are obtained, the weight corresponding to each word is determined based on the obtained word frequency and the inverse document frequency. The embodiment of the application obtains the feature vector based on the obtained weight corresponding to each word.

In some embodiments, the importance corresponding to a word may be determined by the following method:

IF-IDF＝TF(x)*IDF(x)

wherein, IF-IDF represents the corresponding importance of the word; TF (x) represents the word frequency corresponding to the word; IDF (x) represents the inverse document frequency to which the word corresponds.

In specific implementation, the matrix formed by the obtained importance degrees corresponding to the words is used as the feature vector in the embodiment of the present application.

Step S103, inputting the feature vector into the trained classification model.

It should be noted that the classification model in the embodiment of the present application may be a transform bi-directional Encoder, a Bert (Bidirectional Encoder) model.

And step S104, classifying the feature vectors based on the trained classification model to obtain the correlation matching degree of at least one to-be-classified data table and the set data table type.

It should be noted that the set types of data tables include, but are not limited to: personnel, off-network places, on-network places, vehicles, on-network terminal equipment, front-end acquisition equipment, events, information, social units and natural organizations.

In some embodiments, the feature vectors may be classified based on a trained Bert model to obtain a correlation matching degree between a to-be-classified data table and a set data table type.

In other embodiments, the present application may classify the feature vectors based on the trained Bert model to obtain the correlation matching degrees between the multiple data tables to be classified and the set data table types.

And S105, determining the data sheet type corresponding to the data sheet to be classified according to the correlation matching degree.

It should be noted that the data table type corresponding to the data table to be classified represents the subject to which the data table to be classified belongs.

In some embodiments, after a relevant matching degree is obtained based on a trained classification model, the data table type corresponding to the relevant matching degree is used as the data table type corresponding to the data table to be classified.

In other embodiments, after obtaining a plurality of relevant matching degrees based on the trained classification model, the data table type corresponding to the relevant matching degree meeting the set condition is used as the data table type corresponding to the data table to be classified.

The setting conditions may be:

the correlation matching degree is the maximum value of the obtained multiple correlation matching degrees; or

The correlation matching degree is larger than a set threshold value.

According to the embodiment of the application, the characteristic vector is obtained based on the field information, the table name information and the source information of the data table to be classified, so that the characteristic vector can completely represent the characteristics of the data table to be detected; after the obtained feature vectors are input into the trained classification model, at least one correlation matching degree is obtained based on the trained classification model, and the data table type corresponding to the data table to be classified can be accurately determined according to the correlation matching degree, so that the efficiency and the accuracy of determining the data table type corresponding to the data table to be classified are improved.

In some embodiments, after the obtained feature vectors are input into the trained classification model, classifying the feature vectors based on the trained classification model to obtain the correlation matching degree between at least one to-be-classified data sheet and the set data sheet type; then, the data table type corresponding to the data table to be classified is determined according to the obtained one or more relevant matching degrees.

The following describes determining the data table type corresponding to the data table to be classified according to the obtained number of the relevant matching degrees.

In case one, a correlation matching degree is obtained based on the trained classification model.

In some embodiments, the feature vectors are classified based on a trained classification model, and after a correlation matching degree between a data table to be classified and a set data table type is obtained, the data table type corresponding to the obtained correlation matching degree is used as the data table type corresponding to the data table to be classified.

For example, in the embodiment of the present application, based on the trained classification model, if the correlation matching degree between the data table to be classified and the set data table type personnel is 0.7, the personnel corresponding to the correlation matching degree of 0.7 is used as the data table type corresponding to the data table to be classified.

In other embodiments, after obtaining a relevant matching degree based on the trained classification model, the embodiment of the present application displays the data sheet to be classified and the data sheet type corresponding to the relevant matching degree in the display interface, so that the user determines whether the data sheet type corresponding to the data sheet to be classified is correct.

For example, as shown in fig. 2, in the embodiment of the present application, after obtaining a relevant matching degree based on a trained classification model, a to-be-classified data table and a data table type corresponding to the relevant matching degree are displayed in a display interface; if the user triggers a determination instruction by triggering a 'yes' key in the display interface, determining that the data sheet type corresponding to the data sheet to be classified determined by the trained classification model is correct; and if the user triggers a determining instruction by triggering a No key in the display interface, determining that the data sheet type corresponding to the data sheet to be classified determined by the trained classification model is wrong, and displaying the set data sheet type in the display interface so that the user selects the data sheet type corresponding to the data sheet to be classified from the set data sheet types according to the content in the displayed data sheet to be classified.

In some embodiments, if a determining instruction triggered by a user is not received within a set time period, the data table type corresponding to the relevant matching degree is used as the data table type corresponding to the data table to be classified.

For example, the set time period is 30 seconds; after the data table to be classified and the data table type corresponding to the relevant matching degree are displayed in the display interface, whether a determining instruction triggered by a user is received within 30 seconds is determined; if yes, determining the data table type corresponding to the data table to be classified according to the determining instruction; if the determining instruction triggered by the user is not received within 30 seconds, the data sheet type corresponding to the obtained correlation matching degree is used as the data sheet type corresponding to the data sheet to be classified.

And in the second case, a plurality of correlation matching degrees are obtained based on the trained classification model.

In some embodiments, after the feature vectors are classified based on the trained classification model to obtain the correlation matching degrees of the multiple data tables to be classified and the set data table types, the data table types corresponding to the correlation matching degrees meeting the set conditions are used as the data table types corresponding to the data tables to be classified.

Specifically, in some embodiments, the data table type corresponding to the obtained maximum value of the correlation matching degree is used as the data table type corresponding to the data table to be classified.

For example, the types of data tables set are personnel, vehicles, and natural organizations; setting a condition that the correlation matching degree is the maximum value of the plurality of correlation matching degrees; in the embodiment of the application, the feature vectors are input into a trained classification model (such as a Bert model) to obtain the relevant matching degrees of the data sheet to be classified with personnel, vehicles and natural tissues respectively, wherein the relevant matching degrees are 0.8, 0.5 and 0.3; the maximum value of the obtained correlation matching degree is 0.8, and the personnel corresponding to 0.8 are used as the data sheet type corresponding to the data sheet to be classified.

In other embodiments, the data table type corresponding to the correlation matching degree greater than the set threshold is used as the data table type corresponding to the data table to be classified.

For example, the types of data tables set are people, vehicles, and natural organizations; setting the threshold value to be 0.8; in the embodiment of the application, the feature vectors are input into a trained classification model (such as a Bert model) to obtain the relevant matching degrees of the data sheet to be classified with personnel, vehicles and natural tissues respectively, namely 0.9, 0.5 and 0.3; according to the embodiment of the application, the personnel corresponding to the correlation matching degree of 0.9 which is greater than the set threshold value of 0.8 are used as the data sheet type corresponding to the data sheet to be classified.

In other embodiments, in the embodiments of the present application, after the feature vectors are classified based on the trained classification model to obtain the correlation matching degrees between the multiple data tables to be classified and the set data table types, the data tables to be classified, the set data table types and the correlation matching degrees between the data tables to be classified and the set data table types are displayed in the display interface, so that the user determines the data table types corresponding to the data tables to be classified from the set data table types.

In some embodiments, when the set data table types and the related matching degrees are displayed in the display interface, the set data table types are sorted from large to small according to the size of the related matching degrees.

For example, as shown in fig. 3, in the embodiment of the present application, a to-be-classified data table, a set data table type, and a correlation matching degree between the to-be-classified data table and the set data table type are displayed in a display interface; the user can select the data table type corresponding to the data table to be classified from the set data table types according to the information of the data table to be classified displayed in the display interface.

In some embodiments, if it is determined that the user does not determine the data table type corresponding to the data table to be classified within the set period of time, the data table type corresponding to the relevant matching degree meeting the set condition is used as the data table type corresponding to the data table to be classified.

For example, the set time period is 30 seconds; the data sheet to be classified is displayed in the display interface, after the set data sheet type and the related matching degree of the data sheet to be classified and the set data sheet type are determined, the data sheet type corresponding to the related matching degree meeting the set conditions is used as the data sheet type corresponding to the data sheet to be classified after the fact that the data sheet type corresponding to the data sheet to be classified is not determined in the display interface by a user within 30 seconds is determined.

In some embodiments, the classification model is trained to obtain the trained classification model before the trained classification model obtains the correlation matching degree between at least one data table to be classified and the set data table type.

In a specific implementation, the classification model is trained as follows:

the embodiment of the application acquires a sample data set from a database.

The sample data set includes feature information of a plurality of sample data tables and a data table type corresponding to each sample data table.

In some embodiments, the sample data set in the embodiments of the present application is obtained by:

the method comprises the steps of obtaining a metadata table in a database; and extracting the characteristics of the metadata table to obtain the characteristic information of the metadata table.

In some embodiments, after the feature information of the metadata table is obtained, the feature information is subjected to clustering analysis to obtain a clustering result.

In specific implementation, the embodiment of the application can perform clustering analysis on the characteristic information through a K-means algorithm to obtain a clustering result.

In implementation, k pieces of feature information are selected from the feature information to serve as initial clustering centers, and the similarity between other feature information and the initial clustering centers is determined; and matching other characteristic information to the cluster with the highest similarity to obtain an updated cluster, and calculating the updated cluster center until the mean square error starts to converge to obtain a clustering result.

In other embodiments, the embodiment of the present application may further perform cluster analysis on the feature information through a DBSCAN algorithm to obtain a clustering result.

In specific implementation, an epsilon-neighborhood set of characteristic information is determined through a distance measurement formula in the embodiment of the application; if the number of the feature information in the neighborhood set is larger than or equal to a set number threshold, determining the feature information as a core object; and determining the distance from other characteristic information except the core object to the core object, and when the distance is less than epsilon, determining that the other characteristic information and the core object corresponding to the distance are in the same cluster until all the core objects are determined, thereby obtaining a clustering result.

After the clustering result is obtained, the clustering result is matched with the set data table type, and the data table type corresponding to the clustering result is used as the data table type corresponding to the metadata table corresponding to the clustering result.

In some embodiments, if the clustering result is successfully matched with the set data table type, the data table type corresponding to the clustering result is used as the data table type corresponding to the metadata table corresponding to the clustering result.

In specific implementation, the embodiment of the application divides feature information with high similarity into one class through cluster analysis to obtain a cluster result, and determines the data table type corresponding to the cluster result according to the corresponding relation between the cluster result and the set data table type after determining that the cluster result is successfully matched with the set data table type, namely determining that the cluster result is consistent with the cluster effect.

After the data table type corresponding to the clustering result is determined, the corresponding relationship between the feature information and the set data table type can be obtained, as shown in table 1:

table 1: correspondence of characteristic information to set data table type

Set data table type	Characteristic information
		Person(s)	Name, sex, identity card number, height, weight, age, etc
Vehicle with a steering wheel	License plate number, color, engine, vehicle type, vehicle brand, etc
		Case	Case number, case time, case condition, etc
Front-end acquisition equipment	Camera, internet of things sensor and the like
		Off-net location	Internet bar, ktv, bank, government agency, key unit and the like
……	……

After the data table type corresponding to the clustering result is determined, the data table type corresponding to the clustering result is used as the data table type corresponding to the metadata table corresponding to the clustering result.

In other embodiments, if the clustering result fails to match the set data table type, the feature information is adjusted until the clustering result obtained based on the adjusted feature information is successfully matched with the set data table type.

In specific implementation, the embodiment of the application divides the characteristic information with high similarity into one class through cluster analysis to obtain a cluster result, and adjusts the characteristic information after determining that the cluster result is failed to match with the set data table type; and performing clustering analysis on the adjusted characteristic information, determining that the obtained clustering result is successfully matched with the set data table type, namely determining that the obtained clustering result is adjusted to a clustering effect, and determining the data table type corresponding to the clustering result according to the corresponding relation between the clustering effect and the set data table type.

In some embodiments, the method for adjusting feature information in the embodiments of the present application includes, but is not limited to: and (5) screening and increasing.

According to the embodiment of the application, the characteristic information is trained and optimized, so that a clustering result obtained by clustering analysis on the characteristic information is consistent with a clustering effect.

The data table type corresponding to the determined clustering result is used as the data table type corresponding to the metadata table corresponding to the clustering result.

In other embodiments, the data table type corresponding to the metadata table may also be determined by a clustering model in the embodiments of the present application.

In specific implementation, the metadata table is input into a trained clustering model, the feature information of the metadata table is extracted based on the trained clustering model, clustering analysis is performed on the feature information, and the data table type corresponding to the metadata table is determined.

In some embodiments, the trained clustering model is obtained in the following manner.

In specific implementation, the metadata table is obtained from a database, and feature extraction is performed on the metadata table to obtain feature information of the metadata table; inputting the characteristic information of the metadata table into the clustering model as training data, and training the clustering model through a K-means algorithm to obtain a clustering result; and determining a loss value according to the clustering result and the clustering effect, and adjusting parameters of the clustering model according to the loss value until the clustering result achieves the expected effect to obtain the trained clustering model.

After the data table type corresponding to the metadata table is obtained, the characteristic information of the metadata table and the data table type corresponding to the metadata table are used as the characteristic information of the sample data table and the data table type corresponding to the sample data table to obtain a sample data set, and the obtained sample data set is stored in a database.

According to the sample data set, the embodiment of the application executes the loop iteration training on the classification model, and outputs the trained classification model after the training is finished; wherein, the iterative training process in one loop comprises the following operations:

inputting the characteristic information of the sample data table into a classification model, and obtaining a reference data table type of the sample data table based on the classification model;

For example, the classification model is a Bert model, and the Bert model can be trained by executing the training script in the embodiment of the present application; the content of the script of the training script is as follows:

python run_pretraining.py

--input_file＝./records/*.tfrecord

--output_dir＝./bert-dwd

--do_train＝True

--do_eval＝True

--bert_config_file＝./bert-mini/bert_config.json

--train_batch_size＝128

--eval_batch_size＝128

--max_seq_length＝256

--max_predictions_per_seq＝32

--learning_rate＝1e-4

as shown in fig. 4, an embodiment of the present application provides a complete flow diagram of a data table classification method, including the following steps:

step S401, obtaining characteristic information in the data table to be classified.

Step S402, vectorization processing is carried out on the feature information to obtain a feature vector.

In specific implementation, the embodiment of the application can be used for preprocessing the feature information, eliminating null data contained in the feature information and obtaining the processed feature information; and then, vectorizing the processed feature information through a TF-IDF algorithm to obtain a feature vector.

In step S403, the feature vectors are input to the trained classification model.

It should be noted that the trained classification model may be a Bert model.

Step S404, classifying the feature vectors based on the trained classification model to obtain the correlation matching degree of at least one to-be-classified data sheet and the set data sheet type.

Step S405, determining whether the correlation matching degree is one; if yes, go to step S406; if not, step S409 is executed.

Step S406, the data sheet type corresponding to the correlation matching degree is taken as the data sheet type corresponding to the data sheet to be classified.

Step S407, displaying the data table to be classified and the data table type corresponding to the relevant matching degree in the display interface, so that the user can determine whether the data table type corresponding to the data table to be classified is correct.

Step S408, if the determining instruction triggered by the user is not received within the set time period, the data table type corresponding to the relevant matching degree is taken as the data table type corresponding to the data table to be classified.

And step S409, taking the data table type corresponding to the correlation matching degree meeting the set conditions as the data table type corresponding to the data table to be classified.

The setting conditions may be:

the correlation matching degree is the maximum value of the multiple correlation matching degrees; or,

the correlation matching degree is larger than a set threshold value.

Step S410, displaying the data table to be classified, the set data table type and the related matching degree between the data table to be classified and the set data table type in the display interface, so that the user can determine the data table type corresponding to the data table to be classified from the set data table type.

Step S411, if it is determined that the user does not determine the data table type corresponding to the data table to be classified within the set time period, the data table type corresponding to the correlation matching degree meeting the set condition is taken as the data table type corresponding to the data table to be classified.

The data sheet classification method is based on the same inventive concept, and the embodiment of the application also provides a data sheet classification device, and as the principle of solving the problems of the device is similar to that of the method, the device can be implemented by referring to the method embodiment, and repeated parts are not repeated.

As shown in fig. 5, an embodiment of the present application provides a block diagram of a data table classifying device, where the device includes:

an obtaining unit 501, configured to obtain feature information in a data table to be classified; the characteristic information comprises field information, table name information and source information, wherein the table name information is used for representing the name of the data table to be classified, and the source information is used for representing a source system corresponding to the data table to be classified;

the processing unit 502 is configured to perform vectorization on the feature information to obtain a feature vector;

a determining unit 503, configured to input the feature vector into the trained classification model; classifying the feature vectors based on the trained classification model to obtain the correlation matching degree of at least one data table to be classified and the set data table type; and determining the data table type corresponding to the data table to be classified according to the correlation matching degree.

An optional implementation manner is that the determining unit 503 is specifically configured to:

In an optional embodiment, after outputting the associated matching degree of the at least one data table to be classified and the set data table type, the apparatus further includes a display unit 504;

the display unit 504 is specifically configured to:

if the correlation matching degree is one, displaying the data table to be classified and the data table type corresponding to the correlation matching degree in a display interface so that a user can determine whether the data table type corresponding to the data table to be classified is correct or not;

the determining unit 503 is further configured to:

the display unit 504 is specifically configured to:

the determining unit 503 is further configured to:

and if the data sheet type corresponding to the data sheet to be classified is determined not to be determined by the user in the set time period, taking the data sheet type corresponding to the relevant matching degree meeting the set condition as the data sheet type corresponding to the data sheet to be classified.

In an optional embodiment, the apparatus further comprises a training unit 505; the training unit 505 is specifically configured to:

according to the sample data set, performing loop iterative training on the classification model, and outputting the trained classification model when the training is finished; wherein, the iterative training process in one loop comprises the following operations:

In an optional embodiment, the apparatus further comprises a generating unit 506; the generating unit 506 is specifically configured to:

obtaining a metadata table in a database;

carrying out clustering analysis on the characteristic information to obtain a clustering result;

The data sheet classification method is based on the same inventive concept, and the embodiment of the application also provides an electronic device, because the principle of solving the problem of the device is similar to the data sheet classification method, the device can be implemented by referring to the method embodiment, and repeated parts are not described again. As shown in fig. 6, for convenience of illustration, only the portions related to the embodiments of the present application are shown, and specific technical details are not disclosed, so that reference may be made to the portions of the embodiments of the method of the present application. The electronic device may be a terminal or a server.

In this embodiment, the electronic device may be configured as shown in fig. 6, and include a memory 131, a communication module 133, and one or more processors 132.

A memory 131 for storing computer programs executed by the processor 132. The memory 131 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, a program required for running an instant messaging function, and the like; the storage data area can store various instant messaging information, operation instruction sets and the like.

The processor 132 may include one or more Central Processing Units (CPUs), or be a digital processing unit, etc. The processor 132 is used for implementing the above-mentioned data table classification method when calling the computer program stored in the memory 131.

The communication module 133 is configured to perform communication to obtain a data table to be classified and a sample data set.

The specific connection medium among the memory 131, the communication module 133 and the processor 132 is not limited in the embodiments of the present application. In fig. 6, the memory 131 and the processor 132 are connected by a bus 134, the bus 134 is represented by a thick line in fig. 6, and the connection manner between other components is merely illustrative and not limited thereto. The bus 134 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 6, but this is not intended to represent only one bus or type of bus.

The embodiment of the application also provides a storage medium readable by the computing equipment aiming at the image exposure method, namely, the content is not lost after the power is off. The storage medium stores therein a software program comprising program code which, when executed on a computing device, when read and executed by one or more processors, implements aspects of any of the above methods of classifying data tables of embodiments of the present application.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and so forth) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A method for sorting a data table, the method comprising:

vectorizing the characteristic information to obtain a characteristic vector;

inputting the feature vectors into a trained classification model; the trained classification model is obtained by training the classification model according to the sample data set; the sample data set comprises characteristic information of a plurality of sample data tables and a data table type corresponding to each sample data table;

determining the data table type corresponding to the data table to be classified according to the correlation matching degree;

acquiring the sample data set by the following method:

obtaining a metadata table in a database;

matching the clustering result with the set data table type, and if the clustering result is successfully matched with the set data table type, taking the data table type corresponding to the clustering result as the data table type corresponding to the metadata table corresponding to the clustering result; if the clustering result is unsuccessfully matched with the set data table type, adjusting the characteristic information until the clustering result obtained based on the adjusted characteristic information is successfully matched with the set data table type;

2. The method according to claim 1, wherein the determining the type of the data table corresponding to the data table to be classified according to the correlation matching degree comprises:

3. The method according to claim 1, wherein after outputting the associated matching degree of at least one of the data tables to be classified and the set data table type, the method further comprises:

4. The method of claim 1, wherein after determining the associated degree of match between the data table to be classified and the set data table type, the method further comprises:

if the correlation matching degrees are multiple, displaying the data sheet to be classified, the set data sheet type and the correlation matching degree of the data sheet to be classified and the set data sheet type in a display interface so that a user can determine the data sheet type corresponding to the data sheet to be classified from the set data sheet types;

5. The method according to any one of claims 1 to 4, wherein the classification model is trained by:

6. The method of claim 1, wherein the vectorizing the feature information to obtain a feature vector comprises:

vectorizing the processed feature information to obtain a feature vector.

7. An apparatus for sorting a data sheet, the apparatus comprising:

the acquiring unit is used for acquiring the characteristic information in the data table to be classified; the characteristic information comprises field information, table name information and source information, wherein the table name information is used for representing the name of the data table to be classified, and the source information is used for representing a source system corresponding to the data table to be classified;

a determination unit for inputting the feature vectors to a trained classification model; the trained classification model is obtained by training the classification model according to a sample data set; the sample data set comprises characteristic information of a plurality of sample data tables and a data table type corresponding to each sample data table; classifying the feature vectors based on the trained classification model to obtain the correlation matching degree of at least one to-be-classified data table and a set data table type; determining the data table type corresponding to the data table to be classified according to the correlation matching degree;

obtaining the sample data set by the following method:

obtaining a metadata table in a database;

8. The apparatus according to claim 7, wherein the determining unit is specifically configured to:

9. The apparatus according to claim 7, wherein after outputting the associated matching degree of at least one of the data tables to be classified and the set data table type, the determining unit is further configured to:

10. The apparatus according to claim 7, wherein after determining the associated matching degree of the data table to be classified and the set data table type, the determining unit is further configured to:

11. The apparatus according to any of the claims 7 to 10, further comprising a training unit, in particular for training the classification model;

the classification model is trained in the following way:

inputting the characteristic information of the sample data table into the classification model, and obtaining a reference data table type of the sample data table based on the classification model;

12. An electronic device comprising a memory and a processor, the memory having stored thereon a computer program operable on the processor, the computer program, when executed by the processor, implementing the method of any of claims 1-6.

13. A computer-readable storage medium having a computer program stored therein, the computer program characterized by: the computer program, when executed by a processor, implements the method of any one of claims 1 to 6.