CN115422175B

CN115422175B - Invalid data cleaning method based on database historical snapshot

Info

Publication number: CN115422175B
Application number: CN202211031439.1A
Authority: CN
Inventors: 林韶宾; 娄帅; 郑红云; 党中华; 张文凤; 司同; 龙禹; 王佳明; 林禹
Original assignee: Beijing Great Opensource Software Co ltd
Current assignee: Beijing Great Opensource Software Co ltd
Priority date: 2022-08-26
Filing date: 2022-08-26
Publication date: 2023-03-31
Anticipated expiration: 2042-08-26
Also published as: CN115422175A

Abstract

The invention provides an invalid data cleaning method based on historical snapshots of a database, which comprises the following steps: collecting all historical database snapshots in a source database of a distributed system; analyzing data of all collected historical snapshots of the distributed database to obtain a first data table set; and obtaining unidentified data in the distributed database to be cleaned, obtaining a second data table set, selecting the second data tables in the second data table set in sequence, and deleting the currently selected second data table if the currently selected second data table does not exist in the first data table set until all the second data tables in the second data table set exist in the first data table set.

Description

Invalid data cleaning method based on database historical snapshot

Technical Field

The invention relates to the technical field of database data processing, in particular to an invalid data cleaning method based on database historical snapshots.

Background

With the development of internet technology, many industries have entered the mass data era, and most of the current technologies related to big data are focused on data mining and utilization. The mining of large data is necessarily premised on the existence of a large amount of data, but the excessive data obviously brings about no small difficulty in mining and utilization. In the context of contemporary information explosion, the rapid updating of data is accompanied by a dramatic increase in the amount of data, in other words, the latest data must be grasped and the outdated or stale data must be cleaned up in time. Otherwise, the data mining difficulty is greatly increased due to the fact that the data volume is too large, and more importantly, errors of data analysis can be caused directly. At present, when invalid data is cleared, a common method is to directly search and clear the invalid data in a database according to an invalid condition or a time condition, so that a large amount of workload will occur in the searching process, and the large amount of workload will cause the fault tolerance rate to be reduced, thereby affecting the invalid data clearing process.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides an invalid data cleaning method based on historical database snapshots, which is used for rapidly identifying and cleaning invalid data in a database in a historical database snapshot mode, so that the workload of directly searching for the invalid data in the database according to the invalid conditions or time conditions is effectively reduced.

An invalid data cleaning method based on a database historical snapshot comprises the following steps:

collecting all historical database snapshots in a source database; analyzing data of all collected historical database snapshots to obtain a first data table set; and obtaining unidentified data in the database to be cleaned, obtaining a second data table set, selecting the second data tables in the second data table set in sequence, and deleting the currently selected second data table if the currently selected second data table does not exist in the first data table set until the second data table in the second data table set exists in the first data table set.

As an embodiment of the present invention, performing data analysis on all collected historical database snapshots to obtain a first data table set, including: analyzing data of all collected historical database snapshots to obtain file information corresponding to each historical database snapshot and path information corresponding to the file information; generating a data table corresponding to each database historical snapshot according to the file information and the path information corresponding to the file information; and integrating the data tables corresponding to all the historical snapshots of the database to obtain a first data table set.

As an embodiment of the present invention, obtaining unidentified data in a database to be cleaned to obtain a second data table set includes: acquiring all data tables marked as unidentified data in a source database, and establishing a database to be cleaned; and integrating the data tables marked as unidentified data in the database to be cleaned to obtain a second data table set.

As an embodiment of the present invention, acquiring all data tables marked as unidentified data in a source database includes: acquiring all data tables to be identified in a source database; respectively collecting reading time data and data table reading object data in preset time of each data table to be identified; determining the activity of the corresponding data table to be identified according to the reading time data of the data table to be identified; reading object data according to a data table of the data table to be identified to determine the importance degree of the corresponding data table to be identified; performing data effective value analysis according to the liveness and the importance of each data table to be identified to obtain a data effective value of each data table to be identified; and if the effective value of the data of the current data table to be identified is smaller than the threshold value of the effective value of the preset data, marking the unidentified data of the current data table to be identified.

As an embodiment of the present invention, acquiring all data tables to be identified in a source database includes: acquiring a marking instruction sentence input by a user, and analyzing the marking instruction sentence to obtain effective data marking information; the identification instruction statement is a corresponding SQL statement generated by combining effective data identification information preset by a user with a corresponding Structured Query Language (SQL) command; and querying a data table which cannot be matched with the effective data identification information in the visible data of the source database to obtain all data tables to be identified in the source database.

As an embodiment of the present invention, deleting the currently selected second data table includes: generating a first Structured Query Language (SQL) command according to the currently selected second data table and the corresponding database to be cleaned; and executing a first Structured Query Language (SQL) command, and cleaning the currently selected second data table in the second data table set.

As an embodiment of the present invention, a method for cleaning invalid data based on a database history snapshot further includes: generating a second Structured Query Language (SQL) command according to the currently selected second data table and the source database; and executing a second Structured Query Language (SQL) command, and cleaning a table corresponding to the currently selected second data table in the source database.

As an embodiment of the present invention, a method for cleaning invalid data based on a database history snapshot further includes: and automatically generating an operation log record after the deleting operation is finished.

As an embodiment of the present invention, a method for cleaning invalid data based on a historical snapshot of a database further includes: after the cleaning is finished, generating a third Structured Query Language (SQL) command according to the database to be cleaned and the second data table set; and executing a third Structured Query Language (SQL) command, and cleaning all second data tables in the database to be cleaned.

As an embodiment of the present invention, a method for cleaning invalid data based on a database history snapshot further includes: and setting fixed cleaning time, and cleaning the invalid data in the source database once every other fixed cleaning time.

The beneficial effects of the invention are as follows:

the invention provides an invalid data cleaning method based on historical database snapshots, which is used for rapidly identifying and cleaning invalid data in a database in a historical database snapshot mode, so that the workload of directly searching for the invalid data in the database according to the invalid conditions or time conditions is effectively reduced.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and drawings.

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:

FIG. 1 is a flowchart of a method for cleaning invalid data based on a historical snapshot of a database according to an embodiment of the present invention;

fig. 2 is a flowchart of an obtaining method for obtaining all data tables marked as unidentified data in a source database in an invalid data cleaning method based on a historical database snapshot according to an embodiment of the present invention;

fig. 3 is a flowchart of an obtaining method for obtaining all to-be-identified data tables in a source database in an invalid data cleaning method based on a historical database snapshot according to an embodiment of the present invention.

Detailed Description

The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it should be understood that they are presented herein only to illustrate and explain the present invention and not to limit the present invention.

Referring to fig. 1, an embodiment of the present invention provides a method for cleaning invalid data based on a historical database snapshot, including:

s101, collecting all historical database snapshots in a source database;

s102, analyzing data of all collected historical database snapshots to obtain a first data table set;

s103, obtaining unidentified data in a database to be cleaned to obtain a second data table set;

s104, selecting the second data tables in the second data table set in sequence, and deleting the currently selected second data table if the currently selected second data table does not exist in the first data table set;

s105, ending until all second data tables in the second data table set exist in the first data table set;

the working principle of the technical scheme is as follows: when a database invalid data cleaning instruction is received, collecting all database historical snapshots in a current source database, performing data analysis on all the collected database historical snapshots to obtain a first data table set, simultaneously obtaining unidentified data in a database to be cleaned to obtain a second data table set, selecting a second data table in the second data table set in sequence, deleting the currently selected second data table if the currently selected second data table does not exist in the first data table set until all the second data tables in the second data table set exist in the first data table set, and finishing the cleaning of invalid data;

the beneficial effects of the above technical scheme are: the method is used for rapidly identifying invalid data in the database and cleaning the invalid data in the database in a historical database snapshot mode, so that the workload of directly searching the invalid data in the database according to the invalidation condition or time condition is effectively reduced.

In one embodiment, performing data parsing on all collected historical database snapshots to obtain a first data table set, including: analyzing data of all collected historical database snapshots to obtain file information corresponding to each historical database snapshot and path information corresponding to the file information; generating a data table corresponding to each database historical snapshot according to the file information and the path information corresponding to the file information; integrating data tables corresponding to all database historical snapshots to obtain a first data table set;

the working principle and the beneficial effects of the technical scheme are as follows: and obtaining file information corresponding to each database historical snapshot and path information corresponding to the file information through a read-only view of the database historical snapshots, quickly determining corresponding data tables according to the information, and constructing a first data table set, which is beneficial to improving the processing speed of a cleaning preamble part of failure data.

In one embodiment, obtaining unidentified data in a database to be cleaned to obtain a second data table set comprises: acquiring all data tables marked as unidentified data in a source database, and establishing a database to be cleaned; integrating the data tables marked as unidentified data in the database to be cleaned to obtain a second data table set;

the working principle and the beneficial effects of the technical scheme are as follows: the method comprises the steps of obtaining all data tables marked as unidentified data in a source database in advance, establishing a database to be cleaned, obtaining a second data table set according to the database to be cleaned, and being beneficial to improving the processing speed of the preamble part of the invalid data cleaning.

Referring to fig. 2, in one embodiment, obtaining all data tables marked as unidentified data in the source database includes:

s201, acquiring all data tables to be identified in a source database;

s202, respectively collecting reading time data and data table reading object data in preset time of each data table to be identified;

s203, determining the activity degree of the corresponding data table to be identified according to the reading time data of the data table to be identified;

s204, reading object data according to the data table of the data table to be identified to determine the importance degree of the corresponding data table to be identified;

s205, carrying out data effective value analysis according to the activity and the importance of each data table to be identified to obtain a data effective value of each data table to be identified;

s206, if the data effective value of the current data table to be identified is smaller than the preset data effective value threshold, marking unidentified data on the current data table to be identified;

the working principle of the technical scheme is as follows: the invalid data generally comprises physical invalidity and logical invalidity, before the invalid data is judged through historical snapshots, the data is preferably subjected to primary screening through data valid values, and the rest data which cannot be judged are screenedJudging data through historical snapshots; firstly, acquiring all data tables to be identified in a source database, wherein the data tables to be identified are all visible data tables judged through visibility, then respectively acquiring the reading time of each data table to be identified in preset time and reading object data when the data table is subjected to reading operation each time, wherein the reading object data of the data table refers to calling an application end or a program end of the data table, and determining the activity of the corresponding data table to be identified according to the reading time data of the data table to be identified after the reading time data and the reading object data of the data table are obtained, wherein the judgment of the activity is preferably carried out according to the frequency of the reading times in the reading time data, and the calculation method of the activity is preferably as follows:

wherein H is the activity, p is the frequency of the reading times in the preset time, H is the preset time, L _p,h Is a preset weight value corresponding to a preset time h and a frequency p of the reading times within the preset time, wherein the shorter the preset time h is, the higher the frequency p of the reading times within the preset time is, and L is _p,h The larger; then, determining the importance Z corresponding to the data table to be identified according to the data table reading object data of the data table to be identified, wherein the judgment of the importance Z is preferably determined according to the importance of the reading object in the data table reading object data within the preset time, and the calculation mode of the importance Z is preferably as follows: />

Wherein m is _i The method comprises the steps of reading the importance of an object in the ith reading in a preset time h, wherein the importance of the object is determined by the daily use frequency of the object to be read, and the higher the daily use frequency is, the higher the importance of the object to be read is; then, carrying out data effective value analysis according to the activity and the importance of each data table to be identified to obtain the data effective value of each data table to be identified, wherein the analysis method is preferably as follows:

wherein Y is data validThe values alpha and beta are respectively preset weighted values corresponding to the activity and the importance; finally, comparing the data effective value of the current data table to be identified with a preset data effective value threshold, and marking the current data table to be identified with unidentified data if the data effective value of the current data table to be identified is smaller than the preset data effective value threshold;

the beneficial effects of the above technical scheme are: before invalid data is judged through the historical snapshots, the obvious valid data is primarily screened by the data valid values obtained by analyzing the liveness and the importance, so that the workload of subsequently judging the invalid data through the historical snapshots is effectively reduced, and the speed of cleaning the invalid data is increased.

Referring to fig. 3, in an embodiment, acquiring all the to-be-identified data tables in the source database includes:

s301, obtaining an identification instruction sentence input by a user, and analyzing the identification instruction sentence to obtain effective data identification information; the identification instruction statement is a corresponding SQL statement generated by combining corresponding Structured Query Language (SQL) commands based on preset effective data identification information of a user;

s302, inquiring a data table which cannot be matched with the effective data identification information in the visible data of the source database to obtain all data tables to be identified in the source database;

the working principle and the beneficial effects of the technical scheme are as follows: in order to prevent the deletion of some required data, but not commonly used and logic failure data of the user as invalid data, for example, database data such as standard data tables, construction logic and the like used by the user for reference; obtaining an identification instruction sentence input by a user in advance, and analyzing the identification instruction sentence to obtain effective data identification information; the identification instruction statement is a corresponding SQL statement generated by combining effective data identification information preset by a user with a corresponding Structured Query Language (SQL) command, so that a data table which is needed by the acquisition user but is not frequently used is avoided according to the effective data identification information, a data table which cannot be matched with the effective data identification information is inquired in visible data of a source database, all data tables to be identified in the source database are obtained, the improvement of the reliability of invalid data cleaning is facilitated, and the cleaning of data needed by the user is avoided.

In one embodiment, deleting the currently selected second data table includes: generating a first Structured Query Language (SQL) command according to the currently selected second data table and the corresponding database to be cleaned; executing a first Structured Query Language (SQL) command, and cleaning a currently selected second data table in the second data table set;

the working principle and the beneficial effects of the technical scheme are as follows: when the currently selected second data table is deleted, a first Structured Query Language (SQL) command is generated according to the currently selected second data table and the corresponding database to be cleaned, an SQL sentence corresponding to the database to be cleaned is directly generated, cleaning efficiency is improved beneficially, after the first Structured Query Language (SQL) command is obtained, the first Structured Query Language (SQL) command is executed, and the currently selected second data table in the second data table set is cleaned.

In one embodiment, a method for cleaning invalid data based on a historical snapshot of a database further comprises: generating a second Structured Query Language (SQL) command according to the currently selected second data table and the source database; executing a second Structured Query Language (SQL) command, and cleaning a table corresponding to the currently selected second data table in the source database;

the working principle and the beneficial effects of the technical scheme are as follows: in order to save efficiency, in the process of continuously judging and cleaning invalid data, deleting the currently selected second data table in the database to be cleaned, and simultaneously generating a second Structured Query Language (SQL) command according to the currently selected second data table and the source database; and executing a second Structured Query Language (SQL) command to clean the table corresponding to the currently selected second data table in the source database, which is beneficial to improving the efficiency of cleaning invalid data.

In one embodiment, a method for cleaning invalid data based on a database historical snapshot further comprises: automatically generating an operation log record after the deleting operation is finished;

the beneficial effects of the above technical scheme are: and the user can conveniently check the deleted data.

In one embodiment, a method for cleaning invalid data based on a historical snapshot of a database further comprises: after cleaning, generating a third Structured Query Language (SQL) command according to the database to be cleaned and the second data table set; executing a third Structured Query Language (SQL) command, and cleaning all second data tables in the database to be cleaned;

the working principle and the beneficial effects of the technical scheme are as follows: after cleaning is finished, generating a third Structured Query Language (SQL) command according to the database to be cleaned and the second data table set; and executing a third Structured Query Language (SQL) command, cleaning all second data tables in the database to be cleaned, wherein the cleaned second data tables are effective data, and timely processing the database to be cleaned after cleaning is completed is beneficial to reducing the continuous loss of database operation.

In one embodiment, a method for cleaning invalid data based on a historical snapshot of a database further comprises: setting fixed cleaning time, and cleaning invalid data in the source database once every other fixed cleaning time;

the beneficial effects of the above technical scheme are: after the cleaning device is set once, a user does not need to manually input a cleaning instruction every time, cleaning automation is achieved, and cleaning intelligence is improved.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A method for cleaning invalid data based on a database historical snapshot is characterized by comprising the following steps: collecting all historical database snapshots in a source database; analyzing data of all collected historical database snapshots to obtain a first data table set; obtaining unidentified data in a database to be cleaned to obtain a second data table set, selecting second data tables in the second data table set in sequence, and deleting the currently selected second data table if the currently selected second data table does not exist in the first data table set until all the second data tables in the second data table set exist in the first data table set;

obtaining unidentified data in a database to be cleaned to obtain a second data table set, wherein the second data table set comprises the following steps: acquiring all data tables marked as unidentified data in a source database, and establishing a database to be cleaned; integrating the data tables marked as unidentified data in the database to be cleaned to obtain a second data table set;

acquiring all data tables marked as unidentified data in a source database, wherein the data tables comprise: acquiring all data tables to be identified in a source database; respectively collecting reading time data and data table reading object data in preset time of each data table to be identified; determining the activity of the corresponding data table to be identified according to the reading time data of the data table to be identified, wherein the activity calculation method comprises the following steps:

wherein H is the activity, p is the frequency of the reading times within the preset time, H is the preset time,

is a preset weight value corresponding to a preset time h and a frequency p of reading times within the preset time, wherein the shorter the preset time h is, the higher the frequency p of reading times within the preset time is, then ÷>

The larger; determining the importance of the corresponding data table to be identified according to the data table reading object data of the data table to be identified, wherein the calculation mode of the importance Z is as follows: />

Therein->

When it is presetReading the importance of the object in the ith reading in the h, wherein the importance of the reading object depends on the daily use frequency of the reading object, and the higher the daily use frequency is, the higher the importance of the reading object is; carrying out data effective value analysis according to the liveness and the importance of each data table to be identified to obtain the data effective value of each data table to be identified, wherein the calculation mode of the data effective value is as follows: />

Wherein Y is a data valid value>

And &>

Respectively are preset weighted values corresponding to the activity and the importance; and if the effective value of the data of the current data table to be identified is smaller than the threshold value of the effective value of the preset data, marking the unidentified data of the current data table to be identified.

2. The invalid data cleaning method based on the historical database snapshot according to claim 1, wherein performing data analysis on all collected historical database snapshots to obtain a first data table set comprises: analyzing data of all collected database historical snapshots to obtain file information corresponding to each database historical snapshot and path information corresponding to the file information; generating a data table corresponding to each database historical snapshot according to the file information and the path information corresponding to the file information; and integrating the data tables corresponding to all the historical database snapshots to obtain a first data table set.

3. The method for cleaning invalid data based on the historical snapshot of the database as claimed in claim 1, wherein obtaining all the data tables to be identified in the source database comprises: acquiring a marking instruction sentence input by a user, and analyzing the marking instruction sentence to obtain effective data marking information; the identification instruction statement is a corresponding SQL statement generated by combining corresponding Structured Query Language (SQL) commands based on preset effective data identification information of a user; and querying a data table which cannot be matched with the effective data identification information in the visible data of the source database to obtain all data tables to be identified in the source database.

4. The invalid data cleaning method based on the historical snapshot of the database as claimed in claim 1, wherein deleting the currently selected second data table comprises: generating a first Structured Query Language (SQL) command according to the currently selected second data table and the corresponding database to be cleaned; and executing a first Structured Query Language (SQL) command, and cleaning the currently selected second data table in the second data table set.

5. The method for cleaning invalid data based on the historical snapshot of the database as claimed in claim 4, further comprising: generating a second Structured Query Language (SQL) command according to the currently selected second data table and the source database; and executing a second Structured Query Language (SQL) command, and cleaning a table corresponding to the currently selected second data table in the source database.

6. The method for cleaning invalid data based on the historical snapshot of the database as claimed in claim 1, further comprising: and automatically generating an operation log record after the deleting operation is finished.

7. The method for cleaning invalid data based on the historical snapshot of the database as claimed in claim 1, further comprising: after the cleaning is finished, generating a third Structured Query Language (SQL) command according to the database to be cleaned and the second data table set; and executing a third Structured Query Language (SQL) command, and cleaning all second data tables in the database to be cleaned.

8. The method for cleaning invalid data based on the historical snapshot of the database as claimed in claim 1, further comprising: and setting fixed cleaning time, and cleaning the invalid data in the source database once every other fixed cleaning time.