CN113268477A - Data table cleaning method and device and server - Google Patents

Data table cleaning method and device and server Download PDF

Info

Publication number
CN113268477A
CN113268477A CN202110633592.0A CN202110633592A CN113268477A CN 113268477 A CN113268477 A CN 113268477A CN 202110633592 A CN202110633592 A CN 202110633592A CN 113268477 A CN113268477 A CN 113268477A
Authority
CN
China
Prior art keywords
database
data table
data
characteristic parameters
random forest
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110633592.0A
Other languages
Chinese (zh)
Other versions
CN113268477B (en
Inventor
杨雄威
韦星宁
李奕锴
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China United Network Communications Group Co Ltd
Original Assignee
China United Network Communications Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China United Network Communications Group Co Ltd filed Critical China United Network Communications Group Co Ltd
Priority to CN202110633592.0A priority Critical patent/CN113268477B/en
Publication of CN113268477A publication Critical patent/CN113268477A/en
Application granted granted Critical
Publication of CN113268477B publication Critical patent/CN113268477B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/006Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention provides a data table cleaning method, a device and a server, wherein the method comprises the following steps: acquiring an index set of a database in a preset time period, inputting the index set into a preset random forest model, acquiring evaluation values of all data tables in the database, and finally cleaning the data table with the minimum evaluation value in the database to acquire the cleaned database. Model training is carried out by adopting a random forest algorithm according to the storage codes of the database, the metadata information of all the data tables in the database and the evaluation value of each data table, so that a random forest model which implies the relation characteristics among the data tables in the database is obtained, the evaluation values of all the data tables in the database are predicted, and the accuracy of cleaning the data tables is improved.

Description

Data table cleaning method and device and server
Technical Field
The invention relates to the technical field of database processing, in particular to a data table cleaning method, a data table cleaning device and a server.
Background
Over time and with the increase of the business volume, the core business database system of the enterprise generates a plurality of logical conceptual historical tables, temporary tables, intermediate tables and the like, and the tables may lose use value and occupy the storage space of the database. Therefore, the data table needs to be cleaned in time, and the storage space is released to ensure the normal operation of the database program.
In the prior art, a database administrator is mainly relied on experience to clean a data table, specifically, the use condition and the health condition of the database are evaluated by monitoring database indexes and table data in a user space, and the data table in the database is cleaned according to an evaluation result.
However, in the method for manually cleaning the data tables in the database, because the number of indexes of the monitored database is too large and the importance of the indexes is difficult to master, the problem that the indexes of the monitored database are inconsistent with the operation condition of the database exists, the evaluation result of the data tables in the database is inaccurate, and the effect of cleaning the data tables is influenced.
Disclosure of Invention
The invention aims to provide a data table cleaning method, a data table cleaning device and a server, so as to improve the accuracy of cleaning a data table.
In a first aspect, the present invention provides a method for cleaning a data table, including:
acquiring an index set of a database in a preset time period, wherein the index set comprises metadata characteristic parameters of all data tables and relation characteristic parameters of the database;
inputting the index set into a preset random forest model to obtain evaluation values of all data tables in the database, wherein the preset random forest model is obtained by training according to storage codes of the database, metadata information of all data tables in the database and the evaluation value of each data table;
and cleaning the data table with the minimum evaluation value in the database to obtain the cleaned database.
In one possible design, the training step of the preset random forest model includes:
acquiring storage codes of a database, metadata information of all data tables in the database and an evaluation value of each data table, and analyzing the storage codes of the database to acquire relational characteristic information;
performing data processing on the relational characteristic information and the metadata information of all the data tables in the database to obtain metadata characteristic parameters and relational characteristic parameters, and taking the metadata characteristic parameters, the relational characteristic parameters and the evaluation value of each data table as a sample set;
dividing the sample set into a training set and a verification set according to a preset proportion, modeling the training set by adopting a random forest algorithm to obtain a random forest model, and verifying the random forest model according to the verification of the verification set.
In one possible design, after the obtaining the index set of the database within the preset time period, the method further includes:
and preprocessing the index set according to the one-hot coding, and changing discrete parameters in metadata characteristic parameters and relationship characteristic parameters in the index set into continuous characteristic parameters.
In one possible design, the parsing the stored codes of the database to obtain the relationship characteristic information includes:
analyzing the storage codes according to a database to obtain a data table relation map of the database;
and evaluating the relational graph of the data tables according to the audience quantity, the updating magnitude and the updating frequency of all the data tables in the database to obtain the relational characteristic information of the database.
In one possible design, the relationship characteristic parameter includes at least one of an upstream table number, a downstream table number, an upstream table update date number, a downstream table attribution storage validity, a number of existing days, a date update date number, a table name category, a data line number, a data table size, a white list existence, an attribution storage validity, a date average increase, a week average increase, a month average increase, a day increase ring ratio, a week increase ring ratio, and a month increase ring ratio, and the metadata characteristic parameter includes at least one of a user name, a table name, a database name, a generation date, a table update time, a data table line number, and a data table size.
In a second aspect, an embodiment of the present invention provides a data table cleaning apparatus, including:
the system comprises an acquisition module, a storage module and a processing module, wherein the acquisition module is used for acquiring an index set of a database in a preset time period, and the index set comprises metadata characteristic parameters of all data tables and relation characteristic parameters of the database;
the input module is used for inputting the index set into a preset random forest model to obtain the evaluation values of all data tables in the database, wherein the preset random forest model is obtained by training according to the storage codes of the database, the metadata information of all data tables in the database and the evaluation value of each data table;
and the cleaning module is used for cleaning the data table with the minimum evaluation value in the database to obtain the cleaned database.
In one possible design, the data table cleaner further includes:
and the preprocessing module is used for preprocessing the index set according to the one-hot code and changing discrete parameters in metadata characteristic parameters and relationship characteristic parameters in the index set into continuous characteristic parameters.
In a third aspect, an embodiment of the present invention provides a server, including a memory and at least one processor; the memory is used for storing computer execution instructions; at least one processor configured to execute computer-executable instructions stored by the memory to cause the at least one processor to perform a method of implementing a data table cleansing method as described in the first aspect and any one of the first aspects.
In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, where a computer executing instruction is stored, and when a processor executes the computer executing instruction, the method for clearing a data table according to any one of the first aspect and the first aspect is implemented.
In a fifth aspect, an embodiment of the present invention provides a computer program product, which includes a computer program, and when the computer program is executed by a processor, the computer program implements the data table cleaning method according to the first aspect and any one of the first aspect.
According to the data table cleaning method, the data table cleaning device and the data table cleaning server, model training is carried out by adopting a random forest algorithm according to the storage codes of the database, the metadata information of all data tables in the database and the evaluation value of each data table, a random forest model which implies the relation characteristics among the data tables in the database is obtained, the evaluation values of all the data tables in the database are predicted, and the accuracy of cleaning the data tables in the database is improved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.
FIG. 1 is a schematic diagram of a visualization structure of a relational database;
FIG. 2 is a first flowchart of a data table cleaning method according to an embodiment of the present invention;
FIG. 3 is a flowchart of a data table cleaning method according to an embodiment of the present invention;
FIG. 4 is a flow chart of a data table cleaning method according to an embodiment of the present invention;
FIG. 5 is a schematic structural diagram of a data table cleaning apparatus according to an embodiment of the present invention;
fig. 6 is a schematic diagram of a hardware structure of a server according to an embodiment of the present invention.
Detailed Description
With the above figures, certain embodiments of the invention have been illustrated and described in more detail below. The drawings and the description are not intended to limit the scope of the inventive concept in any way, but rather to illustrate it by those skilled in the art with reference to specific embodiments.
Over time and with the increase of business volume, a core business database system of an enterprise generates a plurality of logical conceptual historical tables, temporary tables, intermediate tables and the like, and the tables may lose use value, occupy the storage space of the database in a server, and seriously affect the normal operation of the database. Therefore, it is the necessary work skill of each database administrator to clean up the data tables and free up storage space to ensure that the database program operates properly.
At present, the operation and maintenance of the database storage space in the server mainly depend on the experience of a database administrator, and the database administrator evaluates the use condition and health condition of the database by monitoring the database indexes and the table data in the user space, and the evaluation machine is called an "expert model". The expert model depends on a database administrator with many years of experience to manually select the data table, and sets rules and models according to personal experience so as to judge the service condition of the current database and the condition of the cleanable table. Exemplarily, as shown in fig. 1, fig. 1 is a schematic view of a visualization structure of a relational database, and includes an information node, a data flow line, a cleaning rule node, a conversion rule node, and a data archiving and destroying rule node. In this method of cleaning relational database based on expert model of personal experience, there are following disadvantages: the association relationship among relational database indexes, such as the association relationship among databases E1, T1, R1 and the like in fig. 1, cannot be grasped, and it is difficult to determine what indexes are adopted to analyze the operation condition of the database, which easily causes the data tables to be cleaned by mistake, and has the problems of low reliability and low efficiency.
In order to solve the technical problem, the embodiment of the invention trains a random forest model for predicting the evaluation value of the data table based on a random forest algorithm by utilizing the characteristic of the mutual relation between the data tables in the relational database, objectively and accurately prejudges whether the data table needs to be cleaned, and can effectively improve the cleaning accuracy of the relational database.
Fig. 2 is a first flowchart of a data table cleaning method according to an embodiment of the present invention. The execution subject of the embodiment may be a server storing a database. As shown in fig. 2, the data table cleaning method includes the following steps:
s201: and acquiring an index set of the database in a preset time period, wherein the index set comprises metadata characteristic parameters of all data tables and relational characteristic parameters of the database.
In the embodiment of the invention, index data of the database in a preset time period are collected, and an index set of the database is obtained. Specifically, the metadata characteristic parameters of all the data tables and the relationship characteristic parameters of the database are obtained by collecting update information of the database, data flow line information, data flow quantity, change information and the like of all the data tables. Illustratively, the relational characteristic parameters include at least one of an upstream table number, a downstream table number, an upstream table update-to-date number, whether a downstream table attribution storage is effective, a number of existing days, an update-to-date number, a table name category, a data line number, a data table size, whether a white list exists, whether an attribution storage is effective, a daily average growth, a weekly average growth, a monthly average growth, a daily increase ring ratio, a weekly increase ring ratio, and a monthly increase ring ratio, and the metadata characteristic parameters include at least one of a user name, a table name, a database name, a generation date, a table update time, a data table line number, and a data table size.
In one possible implementation, the index set may be preprocessed according to one-hot encoding, and discrete parameters in the metadata characteristic parameters and the relationship characteristic parameters in the index set are changed into continuous characteristic parameters. Since the metadata characteristic parameters and the relationship characteristic parameters may not be completely continuous values, after the discrete features of the characteristic parameters in the index set are encoded by using one-hot encoding, the features of each dimension in the encoded characteristic parameters can be regarded as continuous values. The problem that the classifier cannot process attribute data well is solved by adopting the one-hot coding, and the function of expanding the characteristic parameters is also played to a certain extent.
S202: and inputting the index set into a preset random forest model to obtain the evaluation values of all data tables in the database, wherein the preset random forest model is obtained by training according to the storage codes of the database, the metadata information of all the data tables in the database and the evaluation value of each data table.
In the embodiment of the invention, the preset random forest model is obtained by training according to the storage codes of the database, the metadata information of all the data tables in the database and the evaluation value of each data table, and the preset random forest model not only implies the basic situation of all the data tables in the database, but also reflects the data relation characteristics among all the data tables in the database. The data relation characteristics comprise data flow line information and data flow quantity among the data tables and data relation characteristics among the data tables and the database. And inputting the index set into a preset random forest model to obtain the evaluation values of all data tables in the database, namely the preset random forest model can predict the evaluation values of all data tables in the database at the current or a future moment.
S203: and cleaning the data table with the minimum evaluation value in the database to obtain the cleaned database.
In the embodiment of the invention, according to the evaluation values of all data tables in the database obtained from the preset random forest model, the higher the evaluation value of the data table is, the tighter the relationship between the data table and other data tables and the database is, the lower the evaluation value of the data table is, the smaller the effect of the data table in the database is, the lower the frequency of data interaction with other data tables is, and the data interaction with other data tables needs to be cleaned in time, so that the occupation of the space of the database is avoided.
It can be known from the above embodiment that the model training is performed by using the random forest algorithm according to the storage codes of the database, the metadata information of all the data tables in the database, and the evaluation value of each data table, so as to obtain a random forest model which implies the relationship characteristics between the data tables in the database, predict the evaluation values of all the data tables in the database, and improve the accuracy of cleaning the data tables in the database.
Fig. 3 is a flowchart of a data table cleaning method according to an embodiment of the present invention. As shown in fig. 3, the training process of the preset random forest model in the embodiment of the present invention includes the following steps:
s301: and acquiring storage codes of the database, metadata information of all data tables in the database and an evaluation value of each data table, and analyzing the storage codes of the database to acquire relational characteristic information.
In this step, the storage code of the database and the metadata information of all the data tables in the database are collected, and the evaluation value of each data table is predicted according to the expert model in the prior art. The storage code of the database implies the relational characteristic information of the data, such as the information of data generation, processing fusion, circulation and application. By analyzing the storage codes of the database, the data table can be traced to the source and the stream, and the relation characteristic information can be obtained.
S302: and performing data processing on the relation characteristic information and the metadata information of all data tables in the database to obtain metadata characteristic parameters and relation characteristic parameters, and taking the metadata characteristic parameters, the relation characteristic parameters and the evaluation value of each data table as a sample set.
In this step, the relational characteristic information and the metadata information of all the data tables in the database are subjected to data indexing and converted into metadata characteristic parameters and relational characteristic parameters. And preprocessing the index set according to the one-hot coding, changing discrete parameters in the metadata characteristic parameters and the relationship characteristic parameters in the index set into continuous characteristic parameters, and taking the metadata characteristic parameters, the relationship characteristic parameters and the evaluation value of each data table as a sample set.
S303: dividing the sample set into a training set and a verification set according to a preset proportion, modeling the training set by adopting a random forest algorithm to obtain a random forest model, and verifying the random forest model according to the verification of the verification set.
In this step, illustratively, according to 8: and 2, dividing the sample set into a training set and a verification set according to the proportion, and modeling the training set by adopting a random forest algorithm to obtain a random forest model. Specifically, a random forest regression algorithm is adopted, an arbitrary dividing point is selected for each feature according to the minimum mean square error principle, the training set is divided into a first data set D1 and a second data set D2, and the feature and feature value optimal dividing point corresponding to the minimum sum of the differences between the first data set D1 and the second data set D2 is obtained according to a formula (1). And (3) establishing a random forest model according to the optimal division point of each characteristic obtained in the formula (1), and verifying the accuracy of the random forest model according to verification of the verification set.
Figure BDA0003104558160000071
Wherein A is a characteristic set, x represents a characteristic, s is a division point of the characteristic, and yiC1 is the mean of the corresponding valuations of all the data tables in the first data set D1, and c2 is the mean of the corresponding valuations of all the data tables in the second data set D2.
According to the embodiment, the random forest model used for hiding the relational characteristics of the data table is trained on the basis of the random forest algorithm, the evaluation value of the data table can be accurately predicted according to the trained random forest model, and the cleaning accuracy of the relational database can be effectively improved.
Fig. 4 is a flowchart of a data table cleaning method according to an embodiment of the present invention. As shown in fig. 4, in the process of training the preset random forest model shown in fig. 3, analyzing the stored codes of the database in S2021 to obtain the relationship characteristic information specifically includes the following steps:
s401: and analyzing the storage codes according to the database to obtain a data table relation map of the database.
In the embodiment of the invention, illustratively, the stored codes can be analyzed by using a neo4j graph data technology, so that the data table can be traced and flow can be obtained, and a data table relation graph of a database can be obtained for evaluating the value of data in the data table.
S402: and evaluating the relational graph of the data tables according to the audience quantity, the updating magnitude and the updating frequency of all the data tables in the database to obtain the relational characteristic information of the database.
In the embodiment of the present invention, the value of the data in the data table can be evaluated in three aspects: in a first aspect: audience size of the data table. On a visualization structural diagram of the relational database, the data outflow node on the right represents audiences, namely, data sheet demanders, and the more data sheet demanders represent the greater the value of the data sheet. If the data table has no branch and no data flow node on the right, the data table loses the use value, and the evaluation value of the data table is predicted to be low. In a second aspect: the data table updates the magnitude. In the visualization structural diagram of the relational database, the thicker the line of the data flow line is, the larger the magnitude of updating of the data table is, and the higher the evaluation value of the data table can be predicted. In a second aspect: the data table is updated frequently, wherein the more frequent the data table is updated, the higher the application value of the data in the data table is, and the higher the evaluation value of the data table can be predicted. By evaluating the data table relational graph according to the audience quantity, the updating magnitude and the updating frequency of the data table, the relational characteristic information of the database can be predicted.
It can be seen from the foregoing embodiment that, by analyzing the stored codes by using the graph database, and evaluating the data table relationship maps obtained after the analysis according to the audience quantities, the update magnitudes, and the update frequencies of all the data tables in the database, the relationship characteristic information implied between all the data tables in the database and the relationship characteristic information between the database and the data tables can be obtained, and the relationship characteristic information is used as training data, so that the relationship characteristics of the data tables in the database are embodied in the trained random forest model, the accuracy of predicting the evaluation values of all the data tables in the database is improved, and the efficiency of cleaning the data tables in the database is improved.
Fig. 5 is a schematic structural diagram of a data table cleaning apparatus according to an embodiment of the present invention. As shown in fig. 5, the data table cleaning apparatus includes: an obtaining module 501, configured to obtain an index set of a database in a preset time period, where the index set includes metadata characteristic parameters of all data tables and relationship characteristic parameters of the database; an input module 502, configured to input the index set into a preset random forest model, and obtain evaluation values of all data tables in the database, where the preset random forest model is obtained by training according to a storage code of the database, metadata information of all data tables in the database, and the evaluation value of each data table; and a cleaning module 503, configured to clean the data table with the smallest evaluation value in the database, and obtain a cleaned database.
In this embodiment, the data table cleaning apparatus may adopt the method described in the above embodiments, and the technical solution and the technical effect thereof are similar to each other and are not described herein again.
In a possible implementation manner, the data table cleaning apparatus further includes: and the preprocessing module is used for preprocessing the index set according to the one-hot code and changing discrete parameters in metadata characteristic parameters and relationship characteristic parameters in the index set into continuous characteristic parameters. The preprocessing module may adopt the method described in the above embodiments, and the technical solutions and technical effects thereof are similar and will not be described herein again.
Fig. 6 is a schematic diagram of a hardware structure of a server according to an embodiment of the present invention. As shown in fig. 6, the server of the present embodiment includes: a processor 601 and a memory 602; wherein
A memory 602 for storing computer-executable instructions;
the processor 601 is configured to execute the computer execution instructions stored in the memory to implement the steps performed by the server in the above embodiments. Reference may be made in particular to the description relating to the method embodiments described above.
Alternatively, the memory 602 may be separate or integrated with the processor 601.
When the memory 602 is provided separately, the server further comprises a bus 603 for connecting the memory 602 and the processor 601.
The embodiment of the present invention further provides a computer-readable storage medium, where a computer execution instruction is stored in the computer-readable storage medium, and when a processor executes the computer execution instruction, the data table cleaning method is implemented as described above.
An embodiment of the present invention further provides a computer program product, which includes a computer program, and when the computer program is executed by a processor, the method for cleaning a data table as described above is implemented.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described device embodiments are merely illustrative, and for example, the division of the modules is only one logical division, and other divisions may be realized in practice, for example, a plurality of modules may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to implement the solution of the present embodiment.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each module may exist alone physically, or two or more modules are integrated into one unit. The unit formed by the modules can be realized in a hardware form, and can also be realized in a form of hardware and a software functional unit.
The integrated module implemented in the form of a software functional module may be stored in a computer-readable storage medium. The software functional module is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) or a processor to execute some steps of the methods described in the embodiments of the present application.
It should be understood that the Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the present invention may be embodied directly in a hardware processor, or in a combination of the hardware and software modules within the processor.
The memory may comprise a high-speed RAM memory, and may further comprise a non-volatile storage NVM, such as at least one disk memory, and may also be a usb disk, a removable hard disk, a read-only memory, a magnetic or optical disk, etc.
The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (Extended Industry Standard Architecture) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, the buses in the figures of the present application are not limited to only one bus or one type of bus.
The storage medium may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.
An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. Of course, the storage medium may also be integral to the processor. The processor and the storage medium may reside in an Application Specific Integrated Circuits (ASIC). Of course, the processor and the storage medium may reside as discrete components in an electronic device or host device.
Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims (10)

1. A method for data table cleaning, comprising:
acquiring an index set of a database in a preset time period, wherein the index set comprises metadata characteristic parameters of all data tables and relation characteristic parameters of the database;
inputting the index set into a preset random forest model to obtain evaluation values of all data tables in the database, wherein the preset random forest model is obtained by training according to storage codes of the database, metadata information of all data tables in the database and the evaluation value of each data table;
and cleaning the data table with the minimum evaluation value in the database to obtain the cleaned database.
2. The method of claim 1, wherein the step of training the pre-set random forest model comprises:
acquiring storage codes of a database, metadata information of all data tables in the database and an evaluation value of each data table, and analyzing the storage codes of the database to acquire relational characteristic information;
performing data processing on the relational characteristic information and the metadata information of all the data tables in the database to obtain metadata characteristic parameters and relational characteristic parameters, and taking the metadata characteristic parameters, the relational characteristic parameters and the evaluation value of each data table as a sample set;
dividing the sample set into a training set and a verification set according to a preset proportion, modeling the training set by adopting a random forest algorithm to obtain a random forest model, and verifying the random forest model according to the verification of the verification set.
3. The method according to claim 1, further comprising, after the obtaining of the index set of the database within the preset time period:
and preprocessing the index set according to the one-hot coding, and changing discrete parameters in metadata characteristic parameters and relationship characteristic parameters in the index set into continuous characteristic parameters.
4. The method of claim 2, wherein parsing the stored code of the database to obtain the relationship characteristic information comprises:
analyzing the storage codes according to a database to obtain a data table relation map of the database;
and evaluating the relational graph of the data tables according to the audience quantity, the updating magnitude and the updating frequency of all the data tables in the database to obtain the relational characteristic information of the database.
5. The method of any one of claims 1 to 4, wherein the relational characteristic parameters comprise at least one of an upstream table number, a downstream table number, an upstream table distance update number of days, whether a downstream table attribution storage is effective, a number of existing days, a distance update number of days, a table name category, a data line number, a data table size, whether a white list is present, whether an attribution storage is effective, a date average growth, a week average growth, a month average growth, a day increment ring ratio, a week increment ring ratio, and a month increment ring ratio, and wherein the metadata characteristic parameters comprise at least one of a user name, a table name, a database name, a generation date, a table update time, a data table line number, and a data table size.
6. A data table cleaner, comprising:
the system comprises an acquisition module, a storage module and a processing module, wherein the acquisition module is used for acquiring an index set of a database in a preset time period, and the index set comprises metadata characteristic parameters of all data tables and relation characteristic parameters of the database;
the input module is used for inputting the index set into a preset random forest model to obtain the evaluation values of all data tables in the database, wherein the preset random forest model is obtained by training according to the storage codes of the database, the metadata information of all data tables in the database and the evaluation value of each data table;
and the cleaning module is used for cleaning the data table with the minimum evaluation value in the database to obtain the cleaned database.
7. The apparatus of claim 6, wherein the data table cleaning apparatus further comprises:
and the preprocessing module is used for preprocessing the index set according to the one-hot code and changing discrete parameters in metadata characteristic parameters and relationship characteristic parameters in the index set into continuous characteristic parameters.
8. A server, comprising a memory and at least one processor;
the memory is used for storing computer execution instructions;
at least one processor configured to execute computer-executable instructions stored by the memory to cause the at least one processor to perform the method of data table cleansing as claimed in any one of claims 1 to 5.
9. A computer-readable storage medium having computer-executable instructions stored thereon, which when executed by a processor, implement the data table cleansing method according to any one of claims 1 to 5.
10. A computer program product comprising a computer program, characterized in that the computer program realizes the method of data table cleansing according to any one of claims 1 to 5 when executed by a processor.
CN202110633592.0A 2021-06-07 2021-06-07 Data table cleaning method and device and server Active CN113268477B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110633592.0A CN113268477B (en) 2021-06-07 2021-06-07 Data table cleaning method and device and server

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110633592.0A CN113268477B (en) 2021-06-07 2021-06-07 Data table cleaning method and device and server

Publications (2)

Publication Number Publication Date
CN113268477A true CN113268477A (en) 2021-08-17
CN113268477B CN113268477B (en) 2023-06-23

Family

ID=77234506

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110633592.0A Active CN113268477B (en) 2021-06-07 2021-06-07 Data table cleaning method and device and server

Country Status (1)

Country Link
CN (1) CN113268477B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009077285A (en) * 2007-09-21 2009-04-09 Nec Corp Packet ring network system and method for managing forwarding database
US20120150825A1 (en) * 2010-12-13 2012-06-14 International Business Machines Corporation Cleansing a Database System to Improve Data Quality
US20180018355A1 (en) * 2016-07-15 2018-01-18 Teqmine Analytics Oy Automated Monitoring and Archiving System and Method
CN109885565A (en) * 2019-02-14 2019-06-14 中国银行股份有限公司 A kind of tables of data method for cleaning and device
CN111737243A (en) * 2020-06-19 2020-10-02 中国银行股份有限公司 Historical data cleaning method and device
CN112559504A (en) * 2020-12-09 2021-03-26 北京思特奇信息技术股份有限公司 Data cleaning method and device based on data heat and storage medium
CN112632051A (en) * 2020-12-25 2021-04-09 中国工商银行股份有限公司 Neural network-based database cleaning method and system
CN112817834A (en) * 2021-01-22 2021-05-18 上海哔哩哔哩科技有限公司 Data table evaluation method and device

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009077285A (en) * 2007-09-21 2009-04-09 Nec Corp Packet ring network system and method for managing forwarding database
US20120150825A1 (en) * 2010-12-13 2012-06-14 International Business Machines Corporation Cleansing a Database System to Improve Data Quality
US20180018355A1 (en) * 2016-07-15 2018-01-18 Teqmine Analytics Oy Automated Monitoring and Archiving System and Method
CN109885565A (en) * 2019-02-14 2019-06-14 中国银行股份有限公司 A kind of tables of data method for cleaning and device
CN111737243A (en) * 2020-06-19 2020-10-02 中国银行股份有限公司 Historical data cleaning method and device
CN112559504A (en) * 2020-12-09 2021-03-26 北京思特奇信息技术股份有限公司 Data cleaning method and device based on data heat and storage medium
CN112632051A (en) * 2020-12-25 2021-04-09 中国工商银行股份有限公司 Neural network-based database cleaning method and system
CN112817834A (en) * 2021-01-22 2021-05-18 上海哔哩哔哩科技有限公司 Data table evaluation method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
X ZHOU等: "Database meets artificial intelligence: A survey", IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, pages 1 - 20 *
李育岭: "海量数据归档与恢复系统技术研究与实现", 中国优秀硕士学位论文全文数据库 (信息科技辑), no. 6, pages 138 - 1134 *

Also Published As

Publication number Publication date
CN113268477B (en) 2023-06-23

Similar Documents

Publication Publication Date Title
CN108427669B (en) Abnormal behavior monitoring method and system
US20150046249A1 (en) Method and system for measuring web advertising effect based on multiple-contact attribution model
CN102460076A (en) Generating test data
CN111127105A (en) User hierarchical model construction method and system, and operation analysis method and system
CN111159184B (en) Metadata tracing method and device and server
Deming et al. Exploratory Data Analysis and Visualization for Business Analytics
CN109992578A (en) Anti- fraud method, apparatus, computer equipment and storage medium based on unsupervised learning
CN104199938A (en) RSS-based agricultural land information sending method and system
CN110688433B (en) Path-based feature generation method and device
Nethery et al. Evaluation of the health impacts of the 1990 clean air act amendments using causal inference and machine learning
CN110737673B (en) Data processing method and system
Brummund et al. Who creates stable jobs? Evidence from Brazil
CN107622409B (en) Method and device for predicting vehicle purchasing capacity
JPWO2015029969A1 (en) Data processing apparatus, data processing method, and program
CN113268477A (en) Data table cleaning method and device and server
CN111523921A (en) Funnel analysis method, analysis device, electronic device, and readable storage medium
CN114926082A (en) Artificial intelligence-based data fluctuation early warning method and related equipment
CN115274121A (en) Health medical data management method, system, electronic device and storage medium
CN111309870B (en) Data rapid searching method and device and computer equipment
CN114860759A (en) Data processing method, device and equipment and readable storage medium
CN112308419A (en) Data processing method, device, equipment and computer storage medium
Ribeiro et al. Simulations of the climate change and its effect on water resources in the Palma River basin, Brazil
CN110163475A (en) Performance calculation method, device, terminal and the readable storage medium storing program for executing of medical institutions
CN116757334B (en) Financial data processing method, system, readable storage medium and computer
CN111971702A (en) Multi-dimensional data organization for efficient analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant