CN113268477A

CN113268477A - Data table cleaning method and device and server

Info

Publication number: CN113268477A
Application number: CN202110633592.0A
Authority: CN
Inventors: 杨雄威; 韦星宁; 李奕锴
Original assignee: China United Network Communications Group Co Ltd
Current assignee: China United Network Communications Group Co Ltd
Priority date: 2021-06-07
Filing date: 2021-06-07
Publication date: 2021-08-17
Anticipated expiration: 2041-06-07
Also published as: CN113268477B

Abstract

The invention provides a data table cleaning method, a device and a server, wherein the method comprises the following steps: acquiring an index set of a database in a preset time period, inputting the index set into a preset random forest model, acquiring evaluation values of all data tables in the database, and finally cleaning the data table with the minimum evaluation value in the database to acquire the cleaned database. Model training is carried out by adopting a random forest algorithm according to the storage codes of the database, the metadata information of all the data tables in the database and the evaluation value of each data table, so that a random forest model which implies the relation characteristics among the data tables in the database is obtained, the evaluation values of all the data tables in the database are predicted, and the accuracy of cleaning the data tables is improved.

Description

Data table cleaning method and device and server

Technical Field

The invention relates to the technical field of database processing, in particular to a data table cleaning method, a data table cleaning device and a server.

Background

Over time and with the increase of the business volume, the core business database system of the enterprise generates a plurality of logical conceptual historical tables, temporary tables, intermediate tables and the like, and the tables may lose use value and occupy the storage space of the database. Therefore, the data table needs to be cleaned in time, and the storage space is released to ensure the normal operation of the database program.

In the prior art, a database administrator is mainly relied on experience to clean a data table, specifically, the use condition and the health condition of the database are evaluated by monitoring database indexes and table data in a user space, and the data table in the database is cleaned according to an evaluation result.

However, in the method for manually cleaning the data tables in the database, because the number of indexes of the monitored database is too large and the importance of the indexes is difficult to master, the problem that the indexes of the monitored database are inconsistent with the operation condition of the database exists, the evaluation result of the data tables in the database is inaccurate, and the effect of cleaning the data tables is influenced.

Disclosure of Invention

The invention aims to provide a data table cleaning method, a data table cleaning device and a server, so as to improve the accuracy of cleaning a data table.

In a first aspect, the present invention provides a method for cleaning a data table, including:

acquiring an index set of a database in a preset time period, wherein the index set comprises metadata characteristic parameters of all data tables and relation characteristic parameters of the database;

inputting the index set into a preset random forest model to obtain evaluation values of all data tables in the database, wherein the preset random forest model is obtained by training according to storage codes of the database, metadata information of all data tables in the database and the evaluation value of each data table;

and cleaning the data table with the minimum evaluation value in the database to obtain the cleaned database.

In one possible design, the training step of the preset random forest model includes:

acquiring storage codes of a database, metadata information of all data tables in the database and an evaluation value of each data table, and analyzing the storage codes of the database to acquire relational characteristic information;

performing data processing on the relational characteristic information and the metadata information of all the data tables in the database to obtain metadata characteristic parameters and relational characteristic parameters, and taking the metadata characteristic parameters, the relational characteristic parameters and the evaluation value of each data table as a sample set;

dividing the sample set into a training set and a verification set according to a preset proportion, modeling the training set by adopting a random forest algorithm to obtain a random forest model, and verifying the random forest model according to the verification of the verification set.

In one possible design, after the obtaining the index set of the database within the preset time period, the method further includes:

and preprocessing the index set according to the one-hot coding, and changing discrete parameters in metadata characteristic parameters and relationship characteristic parameters in the index set into continuous characteristic parameters.

In one possible design, the parsing the stored codes of the database to obtain the relationship characteristic information includes:

analyzing the storage codes according to a database to obtain a data table relation map of the database;

and evaluating the relational graph of the data tables according to the audience quantity, the updating magnitude and the updating frequency of all the data tables in the database to obtain the relational characteristic information of the database.

In one possible design, the relationship characteristic parameter includes at least one of an upstream table number, a downstream table number, an upstream table update date number, a downstream table attribution storage validity, a number of existing days, a date update date number, a table name category, a data line number, a data table size, a white list existence, an attribution storage validity, a date average increase, a week average increase, a month average increase, a day increase ring ratio, a week increase ring ratio, and a month increase ring ratio, and the metadata characteristic parameter includes at least one of a user name, a table name, a database name, a generation date, a table update time, a data table line number, and a data table size.

In a second aspect, an embodiment of the present invention provides a data table cleaning apparatus, including:

the system comprises an acquisition module, a storage module and a processing module, wherein the acquisition module is used for acquiring an index set of a database in a preset time period, and the index set comprises metadata characteristic parameters of all data tables and relation characteristic parameters of the database;

the input module is used for inputting the index set into a preset random forest model to obtain the evaluation values of all data tables in the database, wherein the preset random forest model is obtained by training according to the storage codes of the database, the metadata information of all data tables in the database and the evaluation value of each data table;

and the cleaning module is used for cleaning the data table with the minimum evaluation value in the database to obtain the cleaned database.

In one possible design, the data table cleaner further includes:

and the preprocessing module is used for preprocessing the index set according to the one-hot code and changing discrete parameters in metadata characteristic parameters and relationship characteristic parameters in the index set into continuous characteristic parameters.

In a third aspect, an embodiment of the present invention provides a server, including a memory and at least one processor; the memory is used for storing computer execution instructions; at least one processor configured to execute computer-executable instructions stored by the memory to cause the at least one processor to perform a method of implementing a data table cleansing method as described in the first aspect and any one of the first aspects.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, where a computer executing instruction is stored, and when a processor executes the computer executing instruction, the method for clearing a data table according to any one of the first aspect and the first aspect is implemented.

In a fifth aspect, an embodiment of the present invention provides a computer program product, which includes a computer program, and when the computer program is executed by a processor, the computer program implements the data table cleaning method according to the first aspect and any one of the first aspect.

According to the data table cleaning method, the data table cleaning device and the data table cleaning server, model training is carried out by adopting a random forest algorithm according to the storage codes of the database, the metadata information of all data tables in the database and the evaluation value of each data table, a random forest model which implies the relation characteristics among the data tables in the database is obtained, the evaluation values of all the data tables in the database are predicted, and the accuracy of cleaning the data tables in the database is improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

FIG. 1 is a schematic diagram of a visualization structure of a relational database;

FIG. 2 is a first flowchart of a data table cleaning method according to an embodiment of the present invention;

FIG. 3 is a flowchart of a data table cleaning method according to an embodiment of the present invention;

FIG. 4 is a flow chart of a data table cleaning method according to an embodiment of the present invention;

FIG. 5 is a schematic structural diagram of a data table cleaning apparatus according to an embodiment of the present invention;

fig. 6 is a schematic diagram of a hardware structure of a server according to an embodiment of the present invention.

Detailed Description

With the above figures, certain embodiments of the invention have been illustrated and described in more detail below. The drawings and the description are not intended to limit the scope of the inventive concept in any way, but rather to illustrate it by those skilled in the art with reference to specific embodiments.

Over time and with the increase of business volume, a core business database system of an enterprise generates a plurality of logical conceptual historical tables, temporary tables, intermediate tables and the like, and the tables may lose use value, occupy the storage space of the database in a server, and seriously affect the normal operation of the database. Therefore, it is the necessary work skill of each database administrator to clean up the data tables and free up storage space to ensure that the database program operates properly.

At present, the operation and maintenance of the database storage space in the server mainly depend on the experience of a database administrator, and the database administrator evaluates the use condition and health condition of the database by monitoring the database indexes and the table data in the user space, and the evaluation machine is called an "expert model". The expert model depends on a database administrator with many years of experience to manually select the data table, and sets rules and models according to personal experience so as to judge the service condition of the current database and the condition of the cleanable table. Exemplarily, as shown in fig. 1, fig. 1 is a schematic view of a visualization structure of a relational database, and includes an information node, a data flow line, a cleaning rule node, a conversion rule node, and a data archiving and destroying rule node. In this method of cleaning relational database based on expert model of personal experience, there are following disadvantages: the association relationship among relational database indexes, such as the association relationship among databases E1, T1, R1 and the like in fig. 1, cannot be grasped, and it is difficult to determine what indexes are adopted to analyze the operation condition of the database, which easily causes the data tables to be cleaned by mistake, and has the problems of low reliability and low efficiency.

In order to solve the technical problem, the embodiment of the invention trains a random forest model for predicting the evaluation value of the data table based on a random forest algorithm by utilizing the characteristic of the mutual relation between the data tables in the relational database, objectively and accurately prejudges whether the data table needs to be cleaned, and can effectively improve the cleaning accuracy of the relational database.

Fig. 2 is a first flowchart of a data table cleaning method according to an embodiment of the present invention. The execution subject of the embodiment may be a server storing a database. As shown in fig. 2, the data table cleaning method includes the following steps:

s201: and acquiring an index set of the database in a preset time period, wherein the index set comprises metadata characteristic parameters of all data tables and relational characteristic parameters of the database.

In the embodiment of the invention, index data of the database in a preset time period are collected, and an index set of the database is obtained. Specifically, the metadata characteristic parameters of all the data tables and the relationship characteristic parameters of the database are obtained by collecting update information of the database, data flow line information, data flow quantity, change information and the like of all the data tables. Illustratively, the relational characteristic parameters include at least one of an upstream table number, a downstream table number, an upstream table update-to-date number, whether a downstream table attribution storage is effective, a number of existing days, an update-to-date number, a table name category, a data line number, a data table size, whether a white list exists, whether an attribution storage is effective, a daily average growth, a weekly average growth, a monthly average growth, a daily increase ring ratio, a weekly increase ring ratio, and a monthly increase ring ratio, and the metadata characteristic parameters include at least one of a user name, a table name, a database name, a generation date, a table update time, a data table line number, and a data table size.

In one possible implementation, the index set may be preprocessed according to one-hot encoding, and discrete parameters in the metadata characteristic parameters and the relationship characteristic parameters in the index set are changed into continuous characteristic parameters. Since the metadata characteristic parameters and the relationship characteristic parameters may not be completely continuous values, after the discrete features of the characteristic parameters in the index set are encoded by using one-hot encoding, the features of each dimension in the encoded characteristic parameters can be regarded as continuous values. The problem that the classifier cannot process attribute data well is solved by adopting the one-hot coding, and the function of expanding the characteristic parameters is also played to a certain extent.

S202: and inputting the index set into a preset random forest model to obtain the evaluation values of all data tables in the database, wherein the preset random forest model is obtained by training according to the storage codes of the database, the metadata information of all the data tables in the database and the evaluation value of each data table.

In the embodiment of the invention, the preset random forest model is obtained by training according to the storage codes of the database, the metadata information of all the data tables in the database and the evaluation value of each data table, and the preset random forest model not only implies the basic situation of all the data tables in the database, but also reflects the data relation characteristics among all the data tables in the database. The data relation characteristics comprise data flow line information and data flow quantity among the data tables and data relation characteristics among the data tables and the database. And inputting the index set into a preset random forest model to obtain the evaluation values of all data tables in the database, namely the preset random forest model can predict the evaluation values of all data tables in the database at the current or a future moment.

S203: and cleaning the data table with the minimum evaluation value in the database to obtain the cleaned database.

In the embodiment of the invention, according to the evaluation values of all data tables in the database obtained from the preset random forest model, the higher the evaluation value of the data table is, the tighter the relationship between the data table and other data tables and the database is, the lower the evaluation value of the data table is, the smaller the effect of the data table in the database is, the lower the frequency of data interaction with other data tables is, and the data interaction with other data tables needs to be cleaned in time, so that the occupation of the space of the database is avoided.

It can be known from the above embodiment that the model training is performed by using the random forest algorithm according to the storage codes of the database, the metadata information of all the data tables in the database, and the evaluation value of each data table, so as to obtain a random forest model which implies the relationship characteristics between the data tables in the database, predict the evaluation values of all the data tables in the database, and improve the accuracy of cleaning the data tables in the database.

Fig. 3 is a flowchart of a data table cleaning method according to an embodiment of the present invention. As shown in fig. 3, the training process of the preset random forest model in the embodiment of the present invention includes the following steps:

s301: and acquiring storage codes of the database, metadata information of all data tables in the database and an evaluation value of each data table, and analyzing the storage codes of the database to acquire relational characteristic information.

In this step, the storage code of the database and the metadata information of all the data tables in the database are collected, and the evaluation value of each data table is predicted according to the expert model in the prior art. The storage code of the database implies the relational characteristic information of the data, such as the information of data generation, processing fusion, circulation and application. By analyzing the storage codes of the database, the data table can be traced to the source and the stream, and the relation characteristic information can be obtained.

S302: and performing data processing on the relation characteristic information and the metadata information of all data tables in the database to obtain metadata characteristic parameters and relation characteristic parameters, and taking the metadata characteristic parameters, the relation characteristic parameters and the evaluation value of each data table as a sample set.

In this step, the relational characteristic information and the metadata information of all the data tables in the database are subjected to data indexing and converted into metadata characteristic parameters and relational characteristic parameters. And preprocessing the index set according to the one-hot coding, changing discrete parameters in the metadata characteristic parameters and the relationship characteristic parameters in the index set into continuous characteristic parameters, and taking the metadata characteristic parameters, the relationship characteristic parameters and the evaluation value of each data table as a sample set.

S303: dividing the sample set into a training set and a verification set according to a preset proportion, modeling the training set by adopting a random forest algorithm to obtain a random forest model, and verifying the random forest model according to the verification of the verification set.

In this step, illustratively, according to 8: and 2, dividing the sample set into a training set and a verification set according to the proportion, and modeling the training set by adopting a random forest algorithm to obtain a random forest model. Specifically, a random forest regression algorithm is adopted, an arbitrary dividing point is selected for each feature according to the minimum mean square error principle, the training set is divided into a first data set D1 and a second data set D2, and the feature and feature value optimal dividing point corresponding to the minimum sum of the differences between the first data set D1 and the second data set D2 is obtained according to a formula (1). And (3) establishing a random forest model according to the optimal division point of each characteristic obtained in the formula (1), and verifying the accuracy of the random forest model according to verification of the verification set.

Wherein A is a characteristic set, x represents a characteristic, s is a division point of the characteristic, and y_iC1 is the mean of the corresponding valuations of all the data tables in the first data set D1, and c2 is the mean of the corresponding valuations of all the data tables in the second data set D2.

According to the embodiment, the random forest model used for hiding the relational characteristics of the data table is trained on the basis of the random forest algorithm, the evaluation value of the data table can be accurately predicted according to the trained random forest model, and the cleaning accuracy of the relational database can be effectively improved.

Fig. 4 is a flowchart of a data table cleaning method according to an embodiment of the present invention. As shown in fig. 4, in the process of training the preset random forest model shown in fig. 3, analyzing the stored codes of the database in S2021 to obtain the relationship characteristic information specifically includes the following steps:

s401: and analyzing the storage codes according to the database to obtain a data table relation map of the database.

In the embodiment of the invention, illustratively, the stored codes can be analyzed by using a neo4j graph data technology, so that the data table can be traced and flow can be obtained, and a data table relation graph of a database can be obtained for evaluating the value of data in the data table.

S402: and evaluating the relational graph of the data tables according to the audience quantity, the updating magnitude and the updating frequency of all the data tables in the database to obtain the relational characteristic information of the database.

In the embodiment of the present invention, the value of the data in the data table can be evaluated in three aspects: in a first aspect: audience size of the data table. On a visualization structural diagram of the relational database, the data outflow node on the right represents audiences, namely, data sheet demanders, and the more data sheet demanders represent the greater the value of the data sheet. If the data table has no branch and no data flow node on the right, the data table loses the use value, and the evaluation value of the data table is predicted to be low. In a second aspect: the data table updates the magnitude. In the visualization structural diagram of the relational database, the thicker the line of the data flow line is, the larger the magnitude of updating of the data table is, and the higher the evaluation value of the data table can be predicted. In a second aspect: the data table is updated frequently, wherein the more frequent the data table is updated, the higher the application value of the data in the data table is, and the higher the evaluation value of the data table can be predicted. By evaluating the data table relational graph according to the audience quantity, the updating magnitude and the updating frequency of the data table, the relational characteristic information of the database can be predicted.

It can be seen from the foregoing embodiment that, by analyzing the stored codes by using the graph database, and evaluating the data table relationship maps obtained after the analysis according to the audience quantities, the update magnitudes, and the update frequencies of all the data tables in the database, the relationship characteristic information implied between all the data tables in the database and the relationship characteristic information between the database and the data tables can be obtained, and the relationship characteristic information is used as training data, so that the relationship characteristics of the data tables in the database are embodied in the trained random forest model, the accuracy of predicting the evaluation values of all the data tables in the database is improved, and the efficiency of cleaning the data tables in the database is improved.

Fig. 5 is a schematic structural diagram of a data table cleaning apparatus according to an embodiment of the present invention. As shown in fig. 5, the data table cleaning apparatus includes: an obtaining module 501, configured to obtain an index set of a database in a preset time period, where the index set includes metadata characteristic parameters of all data tables and relationship characteristic parameters of the database; an input module 502, configured to input the index set into a preset random forest model, and obtain evaluation values of all data tables in the database, where the preset random forest model is obtained by training according to a storage code of the database, metadata information of all data tables in the database, and the evaluation value of each data table; and a cleaning module 503, configured to clean the data table with the smallest evaluation value in the database, and obtain a cleaned database.

In this embodiment, the data table cleaning apparatus may adopt the method described in the above embodiments, and the technical solution and the technical effect thereof are similar to each other and are not described herein again.

In a possible implementation manner, the data table cleaning apparatus further includes: and the preprocessing module is used for preprocessing the index set according to the one-hot code and changing discrete parameters in metadata characteristic parameters and relationship characteristic parameters in the index set into continuous characteristic parameters. The preprocessing module may adopt the method described in the above embodiments, and the technical solutions and technical effects thereof are similar and will not be described herein again.

Fig. 6 is a schematic diagram of a hardware structure of a server according to an embodiment of the present invention. As shown in fig. 6, the server of the present embodiment includes: a processor 601 and a memory 602; wherein

A memory 602 for storing computer-executable instructions;

the processor 601 is configured to execute the computer execution instructions stored in the memory to implement the steps performed by the server in the above embodiments. Reference may be made in particular to the description relating to the method embodiments described above.

Alternatively, the memory 602 may be separate or integrated with the processor 601.

When the memory 602 is provided separately, the server further comprises a bus 603 for connecting the memory 602 and the processor 601.

The embodiment of the present invention further provides a computer-readable storage medium, where a computer execution instruction is stored in the computer-readable storage medium, and when a processor executes the computer execution instruction, the data table cleaning method is implemented as described above.

An embodiment of the present invention further provides a computer program product, which includes a computer program, and when the computer program is executed by a processor, the method for cleaning a data table as described above is implemented.

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described device embodiments are merely illustrative, and for example, the division of the modules is only one logical division, and other divisions may be realized in practice, for example, a plurality of modules may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.

The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to implement the solution of the present embodiment.

In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each module may exist alone physically, or two or more modules are integrated into one unit. The unit formed by the modules can be realized in a hardware form, and can also be realized in a form of hardware and a software functional unit.

The integrated module implemented in the form of a software functional module may be stored in a computer-readable storage medium. The software functional module is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) or a processor to execute some steps of the methods described in the embodiments of the present application.

It should be understood that the Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the present invention may be embodied directly in a hardware processor, or in a combination of the hardware and software modules within the processor.

The memory may comprise a high-speed RAM memory, and may further comprise a non-volatile storage NVM, such as at least one disk memory, and may also be a usb disk, a removable hard disk, a read-only memory, a magnetic or optical disk, etc.

The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (Extended Industry Standard Architecture) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, the buses in the figures of the present application are not limited to only one bus or one type of bus.

The storage medium may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.

An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. Of course, the storage medium may also be integral to the processor. The processor and the storage medium may reside in an Application Specific Integrated Circuits (ASIC). Of course, the processor and the storage medium may reside as discrete components in an electronic device or host device.

Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method for data table cleaning, comprising:

2. The method of claim 1, wherein the step of training the pre-set random forest model comprises:

3. The method according to claim 1, further comprising, after the obtaining of the index set of the database within the preset time period:

4. The method of claim 2, wherein parsing the stored code of the database to obtain the relationship characteristic information comprises:

5. The method of any one of claims 1 to 4, wherein the relational characteristic parameters comprise at least one of an upstream table number, a downstream table number, an upstream table distance update number of days, whether a downstream table attribution storage is effective, a number of existing days, a distance update number of days, a table name category, a data line number, a data table size, whether a white list is present, whether an attribution storage is effective, a date average growth, a week average growth, a month average growth, a day increment ring ratio, a week increment ring ratio, and a month increment ring ratio, and wherein the metadata characteristic parameters comprise at least one of a user name, a table name, a database name, a generation date, a table update time, a data table line number, and a data table size.

6. A data table cleaner, comprising:

7. The apparatus of claim 6, wherein the data table cleaning apparatus further comprises:

8. A server, comprising a memory and at least one processor;

the memory is used for storing computer execution instructions;

at least one processor configured to execute computer-executable instructions stored by the memory to cause the at least one processor to perform the method of data table cleansing as claimed in any one of claims 1 to 5.

9. A computer-readable storage medium having computer-executable instructions stored thereon, which when executed by a processor, implement the data table cleansing method according to any one of claims 1 to 5.

10. A computer program product comprising a computer program, characterized in that the computer program realizes the method of data table cleansing according to any one of claims 1 to 5 when executed by a processor.