CN118484449A - Data cleaning method, device, storage medium and computer program product - Google Patents
Data cleaning method, device, storage medium and computer program product Download PDFInfo
- Publication number
- CN118484449A CN118484449A CN202410745972.7A CN202410745972A CN118484449A CN 118484449 A CN118484449 A CN 118484449A CN 202410745972 A CN202410745972 A CN 202410745972A CN 118484449 A CN118484449 A CN 118484449A
- Authority
- CN
- China
- Prior art keywords
- data
- data table
- original
- new
- original data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000004140 cleaning Methods 0.000 title claims abstract description 110
- 238000000034 method Methods 0.000 title claims abstract description 83
- 238000004590 computer program Methods 0.000 title claims abstract description 31
- 238000012216 screening Methods 0.000 claims abstract description 73
- 238000001914 filtration Methods 0.000 claims description 7
- 238000012545 processing Methods 0.000 abstract description 33
- 238000013508 migration Methods 0.000 description 53
- 230000005012 migration Effects 0.000 description 53
- 238000013523 data management Methods 0.000 description 26
- 230000008569 process Effects 0.000 description 24
- 230000002829 reductive effect Effects 0.000 description 14
- 238000010586 diagram Methods 0.000 description 10
- 230000006870 function Effects 0.000 description 10
- 230000007246 mechanism Effects 0.000 description 10
- 238000012423 maintenance Methods 0.000 description 6
- 239000012634 fragment Substances 0.000 description 5
- 230000000717 retained effect Effects 0.000 description 5
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 5
- 230000009286 beneficial effect Effects 0.000 description 4
- 238000006243 chemical reaction Methods 0.000 description 4
- 238000004891 communication Methods 0.000 description 4
- 230000000670 limiting effect Effects 0.000 description 4
- 238000007726 management method Methods 0.000 description 4
- 238000009826 distribution Methods 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 238000013459 approach Methods 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 2
- 238000003780 insertion Methods 0.000 description 2
- 230000037431 insertion Effects 0.000 description 2
- 239000004973 liquid crystal related substance Substances 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000008439 repair process Effects 0.000 description 2
- 230000009471 action Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 239000002131 composite material Substances 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000013524 data verification Methods 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- 230000000593 degrading effect Effects 0.000 description 1
- 230000004069 differentiation Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 230000036961 partial effect Effects 0.000 description 1
- 238000011084 recovery Methods 0.000 description 1
- 230000002441 reversible effect Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application discloses a data cleaning method, equipment, a storage medium and a computer program product, which relate to the technical field of data processing, wherein the method comprises the following steps: acquiring an original data table and data to be cleaned in the original data table; establishing a new data table and a temporary data table based on the original data table; obtaining an index field of the original data table based on the data to be cleaned; screening the data in the original data table according to the index field, and transferring the screened data to the new data table; and the index field is utilized to transfer the data of the original data table to the new data table, so that the accuracy and the efficiency of data transfer are improved. And updating the new data table by utilizing the index field and the temporary data table to obtain a final result table. And updating the new data table by using the temporary data table and the index field to ensure the integrity and accuracy of the data.
Description
Technical Field
The present application relates to the field of data processing technology, and in particular, to a data cleaning method, apparatus, storage medium, and computer program product.
Background
After reconstruction and reconstruction are performed on the data management system, a large amount of useless data exists in the tens of millions of data magnitude tables, and the large part of storage space of the total data volume is occupied. In order to maintain stability and performance of the data management system and save storage space, the data in the data table is cleaned, but the data management system needs to be stopped during data cleaning to affect current business transaction, and the cleaned data is directly cleaned in the original data table, so that the cleaned data is discontinuous in physical storage, the data amount of the data table is reduced after the data cleaning operation, and the data management system still considers that the data table space is completely used. If the data is updated after the data cleaning operation is completed, the cleaned data may be lost.
Therefore, how to accurately clean the data in the data table on the basis of no shutdown of the system becomes a technical problem to be solved currently.
The foregoing is provided merely for the purpose of facilitating understanding of the technical solutions of the present application and is not intended to represent an admission that the foregoing is prior art.
Disclosure of Invention
The application mainly aims to provide a data cleaning method, equipment, a storage medium and a computer program product, which aim to solve the technical problem of accurately cleaning data in a data table on the basis of no shutdown of a system.
In order to achieve the above object, the present application provides a data cleaning method, which includes:
acquiring an original data table and data to be cleaned in the original data table;
Establishing a new data table and a temporary data table based on the original data table;
obtaining an index field of the original data table based on the data to be cleaned;
Screening the data in the original data table according to the index field, and transferring the screened data to the new data table;
And updating the new data table by utilizing the index field and the temporary data table to obtain a final result table.
In one embodiment, the step of obtaining the index field of the original data table based on the data to be cleaned includes:
establishing a combined index of the original data table according to the data to be cleaned;
An index field is determined based on the combined index.
In an embodiment, the index field includes a tag field, a first timestamp, and a second timestamp, and the step of filtering the data in the original data table according to the index field and migrating the filtered data to the new data table includes:
screening the data in the original data table according to the mark field and the first timestamp to obtain screened first data;
migrating the first data to the new data table based on the original data table;
screening the data in the original data table according to the mark field, the first timestamp and the second timestamp to obtain screened second data;
and based on the original data table, migrating the second data to the new data table.
In one embodiment, the step of updating the new data table according to the index field and the temporary data table to obtain a final result table includes:
Performing data updating operation on the new data table based on a preset data updating instruction;
inserting the data before the data updating operation of the new data table into the temporary data table;
performing data screening on the original data table based on the mark field and the second timestamp;
Determining final data by using the original data table, the temporary data table and the new data table according to the data screening result of the original data table;
and merging the final data into the new data table to obtain a final result table.
In one embodiment, the step of determining final data using the original data table, the temporary data table, and the new data table according to the result of the data filtering on the original data table includes:
if the data of the original data table is not screened, determining final data according to the new data table;
and if the data of the original data table is screened, querying the screened data of the original data table in the temporary data table and the new data table.
In an embodiment, the step of querying the temporary data table and the new data table for the selected data of the original data table if the data of the original data table is selected includes:
Performing data query on the temporary data table based on the screened original data table data;
If the screened original data table data is not queried, taking the screened original data table data as final data;
And if the screened original data table data is queried, comparing the original data table, the temporary data table and the new data table one by one to obtain final data.
In an embodiment, the step of merging the final data into the new data table to obtain a final result table further comprises:
And deleting the original data table and the temporary data table.
In addition, to achieve the above object, the present application also proposes a data cleaning apparatus, the apparatus comprising: a memory, a processor and a computer program stored on the memory and executable on the processor, the computer program being configured to implement the steps of the data cleaning method as described above.
In addition, to achieve the above object, the present application also proposes a storage medium, which is a computer-readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, implements the steps of the data cleaning method as described above.
Furthermore, to achieve the above object, the present application provides a computer program product comprising a computer program which, when executed by a processor, implements the steps of the data cleaning method as described above.
One or more technical schemes provided by the application have at least the following technical effects:
The original data table and the data to be cleaned in the original data table are obtained, so that the data to be cleaned is definitely required, the data to be cleaned can be accurately identified and processed in the subsequent steps, and unnecessary misoperation is avoided; and establishing a new data table and a temporary data table according to the original data table, screening the data in the original data table by using an index field of the original data table obtained by the data to be cleaned, transferring the screened data to the new data table, and finally updating the new data table by using the index field and the temporary data table to obtain a final result table. The accuracy and efficiency of data migration are improved by screening through the index field. The temporary data table and the index field are utilized to update the new data table, so that the integrity and the accuracy of the data can be ensured. By establishing the new table and the temporary table, the data to be cleaned in the original data table is accurately and effectively replaced or corrected on the premise that the system is not stopped, the data loss is avoided, and the technical problem of accurately cleaning the data in the data table on the basis that the system is not stopped is solved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.
In order to more clearly illustrate the embodiments of the application or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, and it will be obvious to a person skilled in the art that other drawings can be obtained from these drawings without inventive effort.
FIG. 1 is a schematic flow chart of a first embodiment of a data cleaning method according to the present application;
FIG. 2 is a flow chart of a third embodiment of the data cleaning method according to the present application;
FIG. 3 is a flowchart of a fourth embodiment of a data cleaning method according to the present application;
FIG. 4 is a flowchart of a data cleaning method according to a seventh embodiment of the present application;
FIG. 5 is a schematic flow chart of data cleaning of an original data table;
FIG. 6 is a diagram showing a field comparison method when a temporary data table is used to update new data table;
FIG. 7 is a schematic block diagram of a data cleaning device according to an embodiment of the present application;
fig. 8 is a schematic device structure diagram of a hardware operating environment related to a data cleaning method in an embodiment of the present application.
The achievement of the objects, functional features and advantages of the present application will be further described with reference to the accompanying drawings, in conjunction with the embodiments.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the technical solution of the present application and are not intended to limit the present application.
For a better understanding of the technical solution of the present application, the following detailed description will be given with reference to the drawings and the specific embodiments.
The main solutions of the embodiments of the present application are: and acquiring the original data table and the data to be cleaned in the original data table, establishing a new data table and a temporary data table according to the original data table, screening the data in the original data table by utilizing an index field obtained according to the data to be cleaned, and transferring the screened data to the new data table to avoid cleaning in the original data table directly. And finally, updating the data of the new data table by utilizing the index field and the temporary data table, and avoiding the loss of the data updating by means of the temporary table so as to realize the accurate cleaning of the data.
The present embodiment considers: because the prior art directly cleans the data to be cleaned in the original data table, the system halt during the data cleaning affects the real-time business transaction, and the data cleaning is directly performed in the original data table, the cleaned data can generate discontinuous condition on physical storage, thereby reducing the data amount of the data table after the data cleaning operation, but the system still considers that the data table space is completely used, and the problems of high fragment rate and high water level occur. The data cleaning operation is directly performed on the original data table, and the situation that the data is continuously updated after the data cleaning operation is completed cannot be dealt with, so that the data is updated and lost.
Therefore, the application provides a solution, the data to be reserved is migrated from the original data table to the new data table by utilizing the index field in a mode of creating the data table, the system performs data cleaning on the basis of no shutdown, real-time business transaction is not affected, and the problems of high fragment rate and high water level caused by directly performing data cleaning operation on the original data table are avoided. In order to solve the data updating condition in the data cleaning process, the temporary table is used for avoiding data loss, and the new data table is combined for data cleaning, so that the aim of accurately cleaning the data is fulfilled.
It should be noted that, the execution body of the embodiment may be a computing service device having functions of data processing, network communication and program running, such as a tablet computer, a personal computer, a mobile phone, or an electronic device, a data management system, or the like, which can implement the above functions. The present embodiment and the following embodiments will be described below by taking a data management system as an example.
Based on this, an embodiment of the present application provides a data cleaning method, referring to fig. 1, fig. 1 is a schematic flow chart of a first embodiment of the data cleaning method of the present application.
In this embodiment, the data cleaning method includes steps S10 to S50:
step S10, acquiring an original data table and data to be cleaned in the original data table;
The original data table is the starting point of data cleaning, the original data table is usually stored in a database, and the data management system acquires the storage position of the original data table, which needs to define the data, and the corresponding access authority. After the data table storage location and access rights are specified, an operation is performed using a specific query language, thereby obtaining the original data table.
The data to be cleaned refers to the data which exists in the original data table and needs to be processed. Such data may contain erroneous, duplicate, incomplete, or inconsistent information for a variety of reasons. The purpose of cleaning up the data is to improve the quality and accuracy of the data so that subsequent data analysis and application can be performed based on the reliable data. The data to be cleaned is the data obtained by inquiring the original data table.
Specifically, as a possible implementation manner, the step S10 of obtaining the original data table and the data to be cleaned in the original data table may include: first, a database stored in an original data table is connected using pre-configured database connection information including a host name, a port, a user name, and a password. Further, after connecting to a specific database, SQL (Structured Query Language ) query instructions are executed to obtain the original data table. SQL is a programming language specifically used to communicate with databases and is a standard language used in relational data management systems to manage and manipulate databases. And finally, the data to be cleaned is marked on the original data table in batches and in small batches, the data marking field to be reserved is a default value, and the data marking field to be cleaned is a marking value, so that the data to be cleaned in the original data table is obtained.
In this embodiment, the whole logic structure and characteristics of the original data table are known, and the complete and accurate original data table is obtained through the SQL query instruction. The data to be cleaned is marked in batches and in small batches on the original data table, so that the data to be cleaned is definitely required, the data to be cleaned can be accurately identified and processed in the subsequent steps, unnecessary misoperation is avoided, and accurate data cleaning is realized.
The above is only one possible implementation of step S10 provided in this embodiment, and the specific implementation of step S10 in this embodiment is not specifically limited.
Step S20, a new data table and a temporary data table are established based on the original data table;
And establishing a new data table and a temporary data table which are consistent with the original data table in logic structure according to the logic structure of the original data table, wherein the new data table and the original data table are consistent in logic structure, and the consistency and the integrity of data are ensured. This means that in the new data table, the meaning, constraint and relationship of the data are the same as those of the original data table, so that the problem of data errors or inconsistency caused by inconsistent structures is avoided, and the accuracy of data cleaning is improved. Because the structure of the new data table is the same as that of the original data table, the data can be more easily maintained and managed in the operations of data migration, backup or recovery, and the accuracy of data migration is improved.
Temporary data tables are used to store intermediate results when complex query operations are performed. Because the temporary data table is consistent with the original data table structure, the query process is smoother, and the problem that the data structure is frequently converted or the data types are not matched in the query process is avoided. By storing the intermediate results in the temporary data table, the overhead of repeated computation and data conversion can be reduced, thereby improving query performance and accuracy of data processing.
Specifically, as a possible implementation manner, the step S20 of creating a new data table and a temporary data table based on the original data table may include:
First, the logical structure of the original data table is determined by querying the metadata or information schema of the database. Determining a query statement according to the type of the database, querying metadata or information mode of the database according to the query statement, for example, if the database type is MySQL, and querying the metadata or information mode of the database by using SHOW COLUMNS FROM commands to determine the logical structure of the original database.
Further, the SQL statement is written according to the logical structure of the original data table. After determining the data structure logic of the original data TABLE, a new data TABLE and a temporary data TABLE are created by writing a CREATE TABLE statement. The same column name and data type as the original data table should be included in this statement, but no index definition is included, no index is created in the newly created new data table and temporary data table.
It should be noted that the CREATE TABLE statement is a statement for creating a new data TABLE in the structured query language, and may select a data type and a constraint according to actual needs, and may specify a storage engine and a character set of the data TABLE. The use of the CREATE TABLE statement needs to ensure that the correct database has been selected and that there is sufficient authority to CREATE a new data TABLE in the database.
Further, SQL statements are executed, creating new and temporary data tables. The CREATE TABLE statement is executed in the database management tool or command line to CREATE a new data TABLE and temporary data TABLE.
Finally, the logical structures of the new data table and the temporary data table are verified. In order to ensure that the logic structures of the new data table and the temporary data table are consistent with the original data table, determining the logic structures of the new data table and the temporary data table according to the method for determining the logic structures of the original data table, comparing the logic structures of the new data table and the temporary data table obtained by inquiry, and if the logic structures of the three data tables are inconsistent, returning to execute the steps of determining the logic structures of the original data table, writing SQL sentences and executing the SQL sentences until the logic structures of the three data tables are consistent.
It will be appreciated that different data management systems may have different syntax or commands to view the data table structure and create new data tables, and that adjustments to query statements and commands may be required during specific creation depending on the particular data management system being used.
In this embodiment, first, by determining the logical structure of the original data table, the SQL statement is written and executed according to the logical structure of the original data table, and a new data table and a temporary data table which are consistent with the logical structure of the original data table are obtained. And finally, verifying the logic structures of the new data table and the temporary data table, and ensuring the consistency and the integrity of the data. Therefore, the problem of data errors or inconsistency caused by inconsistent structures is avoided, and the accuracy of the follow-up data cleaning operation is improved.
The above is only one possible implementation of step S20 provided in this embodiment, and the specific implementation of step S20 in this embodiment is not specifically limited.
Step S30, obtaining an index field of the original data table based on the data to be cleaned;
an index field is a special data structure in a database for locating data quickly, a field that can uniquely identify the data or group the data in some specific logic. For example, the "region" and "age" are index fields that hold customer data within a particular region or age range in a data table containing customer information. The index field allows the data management system to find the desired data row without having to scan the entire data table, thereby significantly improving query performance. For large data tables, without the appropriate index field, the query operation may become very slow.
The index field is determined according to the data to be cleaned, and the query frequency, the data memory size and the data update frequency need to be considered. For example: large fields with large data memory, such as text and binary data, are often unsuitable as index fields, taking up a lot of disk space and reducing index efficiency. As another example, frequently updated fields are not suitable as index fields, and each update may trigger a maintenance operation of the index, thereby degrading operation performance. After determining the index field of the original data table, the performance of the index needs to be maintained and monitored periodically to ensure that the index field remains efficient as the data changes.
It should be noted that the data to be cleaned itself does not directly determine which fields should be indexed, and more so, provides a guarantee on data quality for the subsequent indexes. Database management tools or programming languages may be used to automatically analyze query patterns and data distributions to recommend appropriate index fields. However, the final determination of the index field still needs to be done based on business requirements, and the database management tool or programming language may not fully understand the business meaning of the data and the complexity of the query.
Step S40, screening the data in the original data table according to the index field, and transferring the screened data to the new data table;
And writing a database query statement according to the selected index field, wherein the database query statement needs to be determined according to the type of the database. For example: the database type is MySQL, then the query statement should include a Select clause specifying the fields that need to be migrated and a From clause specifying the original data table. The Where clause sets the screening conditions. And screening the data in the original data table by executing the database query statement, screening all the data meeting the screening conditions in the original data table, and inserting the data meeting the screening conditions into the new data table.
The index field is used for screening the data in the original data table, so that the data migration efficiency can be improved, the data which should be migrated to the new data table can be accurately controlled, unnecessary data migration can be avoided, the use of storage space is reduced, the data in the new data table is ensured to be accurate and complete, and the problems of high fragments and high water level after data cleaning are avoided. The migration requirements of different data can be adapted by modifying the index field, and the flexibility and the expandability are high.
And S50, carrying out data updating on the new data table by utilizing the index field and the temporary data table to obtain a final result table.
After migrating the data to be retained to the new data table, if the original data table is updated, the data in the new data table may be lost. And the data is cleaned directly in the original data table, and after the data cleaning operation, if the metadata original data table is updated, the situation of data loss can also occur. Therefore, in order to avoid data loss after data update, the data integrity in the data table is maintained, and the new data table is subjected to data update by means of the index field and the temporary data table, so that the data loss after data update is avoided.
The temporary data table is consistent with the logical structure of the original data table and the new data table, and the index field is not only used for screening data, but also used for matching records in the updating process. This means that the temporary data table in the new data table should contain an index field in order to be able to correctly relate the information in the temporary data table to the corresponding record in the new data table. Therefore, an index field consistent with the original data table is required to be established on the new data table, and the index of the new data table is established according to the index of the original data table, so that the data of the new data table is updated, and the data is not lost.
And updating the data in the temporary data table into the corresponding record of the new data table to obtain a final result table. The final result table is a data table after the data cleaning operation is completed, and contains data that is screened, migrated, and possibly further processed or updated from the original data table. The final result table is the final output of the data cleaning or updating process, reflecting the final state of the required data.
The embodiment provides a data cleaning method, which is characterized in that the data to be cleaned is definitely required to be cleaned by acquiring an original data table and the data to be cleaned in the original data table, so that the data to be cleaned can be accurately identified and processed in the subsequent steps, and unnecessary misoperation is avoided; and establishing a new data table and a temporary data table according to the original data table, screening the data in the original data table by using an index field of the original data table obtained by the data to be cleaned, transferring the screened data to the new data table, and finally updating the new data table by using the index field and the temporary data table to obtain a final result table. The accuracy and efficiency of data migration are improved by screening through the index field. The temporary data table and the index field are utilized to update the new data table, so that the integrity and the accuracy of the data can be ensured. By establishing the new table and the temporary table, the data to be cleaned in the original data table is accurately and effectively replaced or corrected on the premise that the system is not stopped, data loss is avoided, and the data in the data table is accurately cleaned on the basis that the system is not stopped.
Based on the first embodiment of the present application, a second embodiment of the present application is proposed. In the second embodiment of the present application, the same or similar content as in the first embodiment may be referred to the description above, and will not be repeated.
In this embodiment, the step S30 of obtaining the index field of the original data table based on the data to be cleaned may include steps S301 to S302;
step S301, establishing a combined index of the original data table according to the data to be cleaned;
the combined index is an index containing a plurality of columns. In a data management system, combining indexes allows a database query optimizer to utilize multiple columns in the index to improve query performance. The combined index is different from a single-column index, which contains index information of only one column.
The creation of the combined index is to meet the needs of queries that are often based on multiple columns. When the query conditions contain columns in the combined index, the data management system can more efficiently locate the desired data, thereby significantly increasing query speed.
The data can be more effectively positioned by establishing the combined index of the original data table according to the data to be cleaned, and the data query speed is improved, so that the data migration speed is improved. The creation of a combined index from the data to be cleaned first requires the selection of columns, which often occur together in the query condition as part of the combined index. Secondly, the order of columns needs to be considered, which affects the efficiency and usability of the index. Typically, those columns for equivalence comparison are placed in front and those for range queries are placed in back. Finally, maintenance of the combined index needs to be considered. As with single column indexes, combined indexes also require maintenance. When the data in the data table changes, the index also needs to be updated accordingly.
Step S302, determining an index field based on the combined index.
The process of determining the index fields from the built combined index is essentially based on factors such as the requirements of the database query, the structure of the data table, and the data distribution, analyzing and determining which fields should be combined together to form a composite index.
Analyzing the data distribution and selectivity of fields based on combining indexes, selecting fields with high selectivity (i.e., fields with large differences in field values) is more valuable for indexes, which can help the data management system locate data more accurately. The length of the index and the maintenance cost are comprehensively considered. An excessively long index may increase the consumption of memory space and the overhead of index maintenance. Thus, in selecting the index field, the length of the index needs to be weighed against the expected performance improvement.
After consideration of the index field analysis and maintenance costs, the index field of the original data table is determined to be (tag field, time stamp). Wherein the tag field is an identifier with a differentiation, and the time stamp is used to record the modification time of the data. The time stamp contains different time nodes, and the data to be cleaned are marked and cleaned in batches and in small batches by marking the different time nodes, so that the influence on real-time transactions during marking is minimized.
In the embodiment, by establishing the combined index, the query performance related to the marked field and the timestamp field is obviously improved, the query time and the full table scanning of the database on the data in the query process are reduced, the I/O operation and the CPU utilization rate are reduced, and the use of system resources is optimized. And determining the index field according to the combined index to accelerate data positioning, optimizing data cleaning operation involving a plurality of fields, ensuring accuracy and consistency of data cleaning, and improving quality of data cleaning.
A third embodiment of the present application is presented based on the first and/or second embodiment of the present application. In the third embodiment of the present application, the same or similar contents as those of the first and second embodiments may be referred to the above description, and will not be repeated.
On this basis, please refer to fig. 2, fig. 2 is a flow chart of a third embodiment of the data cleaning method according to the present application.
In this embodiment, the index field includes a tag field, a first timestamp, and a second timestamp, and the step S40 of screening the data in the original data table according to the index field and migrating the screened data to the new data table includes steps S401 to 404:
step S401, screening the data in the original data table according to the mark field and the first timestamp to obtain screened first data;
the screening condition is determined according to the specific value of the marking field and the specific value of the first timestamp, wherein the screening condition is that the marking field and the data timestamp are smaller than the first timestamp. The screening conditions are specified explicitly, so that the data selected from the original data table can be ensured to be accurate and meet the service requirements. This helps to avoid data errors or inconsistencies.
Further, according to the screening conditions, query sentences are written, and data meeting the screening conditions are selected from the original data table. The general structure of a query statement includes: select clauses representing Select fields, from clauses specifying the data pick table, white clauses specifying the screening conditions. The specific value of the tag field = 'tag field' is a first filtering condition that ensures that the value of the tag field in the selected data line matches the given specific value. Wherein a data timestamp less than the 'specific value of the first timestamp' is a second screening condition, ensuring that the selected data line has a timestamp earlier than the given first timestamp. By explicitly specifying the screening conditions, the query statement can be precisely located to the data line that meets the particular conditions. This avoids returning unnecessary data, thereby ensuring accuracy and relevance of the data cleaning.
Finally, the query statement is executed. And submitting the compiled query statement to be executed, traversing the original data table by the data management system, matching according to the formulated screening conditions, and returning all the data rows meeting the conditions. By limiting the range of data returned by the filter criteria, unnecessary data transmission and processing can be reduced, thereby saving computing resources and network bandwidth.
It should be noted that, in the query statement, it is necessary to ensure that the format of the timestamp matches the format stored in the original data table. If the timestamp format does not match, the query may return an erroneous result or may not be executed at all.
Step S402, based on the original data table, migrating the first data to the new data table;
Because the original data table and the new data table have the same logic structure, the first data can be accurately migrated to the new data table, the situation that data conversion errors or data loss occur in the data migration process is prevented, the application program can acquire the data from the new data table after the data is migrated to the new data table, and the application program is ensured to work normally.
And concurrently migrating the first data to a new data table, wherein the concurrent migration of the data means that a multithreading or multi-process mode is used for simultaneously processing a plurality of data migration tasks, so that the data migration efficiency is greatly improved. The concurrent migration of the first data first requires writing a function or migration process that can perform the migration of a single piece of data, migrating the data from the original data table to the new data table.
Further, written functions or migration processes are executed concurrently, so that concurrent migration of data in the original data table is achieved. For a data management system provided with a built-in concurrent processing mechanism, the concurrent migration of data can be realized only by executing sentences of the concurrent processing mechanism. For data management systems that do not provide built-in concurrent processing mechanisms, the data migration tasks may be performed concurrently using a multi-threaded or multi-process library approach, with each thread or process processing a portion of the data migration tasks.
Finally, error handling and logging mechanisms are established to track and repair when problems occur. The log recording mechanism can adjust parameters such as concurrency, batch processing size and the like according to the migration progress and performance so as to achieve optimal migration speed and resource utilization rate.
Step S403, screening the data in the original data table according to the tag field, the first timestamp and the second timestamp to obtain screened second data;
the screening condition is determined according to the specific value of the marking field, the specific value of the first time stamp and the specific value of the second time stamp, wherein the screening condition is that the marking field and the data time stamp are smaller than the second time stamp and larger than or equal to the first time stamp. The screening conditions are specified explicitly, so that the data selected from the original data table can be ensured to be accurate and meet the service requirements. This helps to avoid data errors or inconsistencies.
Further, according to the screening conditions, query sentences are written, and data meeting the screening conditions are selected from the original data table. The general structure of a query statement includes: select clauses representing Select fields, from clauses specifying the data pick table, white clauses specifying the screening conditions. The specific value of the tag field = 'tag field' is a first filtering condition that ensures that the value of the tag field in the selected data line matches the given specific value. The 'specific value of the first timestamp'. Ltoreq.data timestamp < '> specific value of the second timestamp' is a second screening condition ensuring that the selected data line has a timestamp earlier than the given second timestamp but greater than and equal to the first timestamp. By explicitly specifying the screening conditions, the query statement can be precisely located to the data line that meets the particular conditions. This avoids returning unnecessary data, thereby ensuring accuracy and relevance of the data cleaning.
Finally, the query statement is executed. And submitting the compiled query statement to be executed, traversing the original data table by the data management system, matching according to the formulated screening conditions, and returning all the data rows meeting the conditions. By limiting the range of data returned by the filter criteria, unnecessary data transmission and processing can be reduced, thereby saving computing resources and network bandwidth.
Step S404, based on the original data table, migrating the second data to the new data table.
Because the original data table and the new data table have the same logic structure, the second data can be accurately migrated to the new data table, the situation that data conversion errors or data loss occur in the data migration process is prevented, the application program can acquire the data from the new data table after the data is migrated to the new data table, and the application program is ensured to work normally.
And concurrently migrating the second data to the new data table, wherein the concurrent migration of the data means that a multithreading or multi-process mode is used for simultaneously processing a plurality of data migration tasks, so that the data migration efficiency is greatly improved. The concurrent migration of the second data first requires writing a function or migration process that can perform the migration of a single piece of data, migrating the data from the original data table to the new data table.
Further, written functions or migration processes are executed concurrently, so that concurrent migration of data in the original data table is achieved. For a data management system provided with a built-in concurrent processing mechanism, the concurrent migration of data can be realized only by executing sentences of the concurrent processing mechanism. For data management systems that do not provide built-in concurrent processing mechanisms, the data migration tasks may be performed concurrently using a multi-threaded or multi-process library approach, with each thread or process processing a portion of the data migration tasks.
Finally, error handling and logging mechanisms are established to track and repair when problems occur. The log recording mechanism can adjust parameters such as concurrency, batch processing size and the like according to the migration progress and performance so as to achieve optimal migration speed and resource utilization rate.
In this embodiment, the data to be retained is filtered according to the tag field and the first timestamp, the obtained first data is migrated to the new data table, and then the data to be retained is continuously filtered according to the tag field, the first timestamp and the second timestamp, and the second data is concurrently migrated to the new data table. The mark field and different time stamps are utilized to carry out batch and small-batch migration of data, so that the memory, CPU and network resources required by single migration operation are obviously reduced, and migration failure or performance degradation caused by insufficient resources are avoided. Because the migration is performed in batches, the locking time of the original data table can be reduced to the greatest extent, and therefore the influence on business operation is reduced. In addition, if a problem occurs in the migration process, the state can be rolled back to the previous state more easily, and the interference on service continuity is reduced. Meanwhile, batch migration allows finer control of the migration process, and migration of each batch of data according to preset conditions and rules can be ensured, so that accuracy of data cleaning is improved.
The fourth embodiment of the present application is proposed based on the above-described embodiment of the present application. In the fourth embodiment of the present application, the same or similar contents as those of the above-described embodiments can be referred to the above description, and the description thereof will not be repeated.
On this basis, please refer to fig. 3, fig. 3 is a flowchart illustrating a fourth embodiment of the data cleaning method according to the present application.
In this embodiment, the step S50 of updating the new data table according to the index field and the temporary data table to obtain the final result table may include steps S501 to 505:
step S501, performing a data update operation on the new data table based on a preset data update instruction;
In a distributed system or data warehouse, data update operations are key steps in achieving data synchronization and integration. By updating the new data table based on the preset instruction, the consistency of the data among different data sources can be ensured, the data redundancy and inconsistency are eliminated, and the quality and usability of the data are improved. And carrying out data updating operation on the new data table according to a preset data updating instruction, and ensuring that the data between the new data table and the original data table possibly subjected to data updating is consistent.
Step S502, inserting the data before the data updating operation of the new data table into the temporary data table;
In order to prevent the situation that the new data table is lost in the data updating process, the temporary data table is used for avoiding the data loss, and accurate cleaning of the data is further ensured. The new data table is consistent with the logical structure of the temporary data table, so that the temporary data table can store data in the new data table. The current data in the new data table is inserted into the temporary data table by writing a data copy statement before the data update operation is performed. After the data of the new data table is safely stored in the temporary data table, a preset data updating instruction is executed on the new data table, and the data updating operation is performed.
After the update operation is completed, it is first verified whether the data after the update of the new data table meets expectations and the backup temporary data table is considered for future reference or restoration of the data. And secondly, verifying the data in the temporary data table, and ensuring that the data in the temporary data table is consistent with the data in the new data table before the updating operation, so that the state of the new data table can be recovered through the data in the temporary data table when errors occur in the data updating process or the updating result is not in accordance with the expectations, and the data loss or damage caused by the updating operation errors is avoided.
Step S503, performing data screening on the original data table based on the tag field and the second timestamp;
Because the data in the new data table is migrated from the original data table, and the data update loss condition occurs after the new data table is updated, the reason is that the data to be reserved is updated to the original data table after being migrated to the new data table, but the data in the new data table is not updated, so that the data update is lost. Therefore, the data updated by the original data table needs to be filtered according to the mark field and the timestamp, and the problem of data update loss is solved according to the filtered data result.
Firstly, determining screening conditions according to the mark field and the second timestamp, wherein the screening conditions are as follows: the specific value 'of the marking field and the data timestamp > =' the second timestamp, and the data updated by the original data table after the data to be reserved is migrated to the new data table is screened out through screening conditions.
Further, a query statement is written according to the screening conditions, and updated data conforming to the screening conditions is queried from the original data table so as to carry out subsequent processing. The general structure of a query statement includes: select clauses representing Select fields, from clauses specifying the data pick table, white clauses specifying the screening conditions. The specific value of the tag field = 'tag field' is a first filtering condition that ensures that the value of the tag field in the selected data line matches the given specific value. The specific value 'of the data timestamp > =' second timestamp is a second filtering condition, and the data updated by the original data table after the data to be reserved is migrated to the new data table is filtered.
Step S504, determining final data by using the original data table, the temporary data table and the new data table according to the result of data screening of the original data table;
After the reserved data is migrated to the new data table, the original data table is updated. The data updating of the original data table has two cases, the first case is that the data updated by the original data table is data which does not need to be reserved, and the data updating operation of the original data table does not cause the data updating loss of the data to be reserved because the data to be reserved is already migrated to the new data table. The second case is that the data updated by the original data table is the data to be reserved, and the data updating operation is performed by the original data table because the data migration is completed, so that the data updating in the new data table is lost.
It should be noted that, after the data to be reserved is migrated to the new data table, the original data table is updated, and the updating operation may include deleting, modifying and inserting the data. The specific updating operation is different, and the final obtained data is also different.
In order to distinguish two different conditions and specific updating operations, ensure the accuracy in the data cleaning process, obtain a precise data cleaning result, and need to judge final data by using the original data table, the temporary data table and the new data table according to the result of data screening on the original data table.
Specifically, as a possible implementation manner, the step S504 of determining final data using the original data table, the temporary data table, and the new data table according to the result of the data screening on the original data table may include: first, the original data table is compared with the temporary data table by the data processing tool, and the difference between the two data tables is identified. Further, after the data difference between the temporary data table and the original data table is identified, the new data table is combined for comprehensive judgment. The new data table contains data migrated from the original data table, and the logic of the final data is determined by analyzing the relationship between the new data table and the temporary data table. And finally, executing corresponding data updating operation according to the determined final data logic.
In this embodiment, the final data is determined by combining the comparison and analysis of multiple data sources, rather than simply relying on a single data screening operation. The final data is determined more flexibly and comprehensively, and the complex data processing requirements are better adapted.
The above is only one possible implementation of step S504 provided in this embodiment, and the specific implementation of step S504 in this embodiment is not specifically limited.
Step S505, merging the final data into the new data table to obtain a final result table.
After determining the final data, updating the final data into a new data table according to the primary key, wherein the new data table after the primary key is combined is a final result table. The data contained in the final result table is the data after accurate data cleaning, the problems of high water level and high fragment rate caused by directly cleaning the data on the original data table are eliminated, the problem of data loss caused by data updating is solved, and the data after accurate cleaning is obtained.
It should be noted that the primary key (PRIMARY KEY) is a field or a combination of fields for uniquely identifying each row of records in the table. Because the key value of each main key is unique, the repeated value is not allowed to appear, thereby avoiding the redundancy and the repeatability of the data and ensuring the correctness and the consistency of the data.
In this embodiment, the data lines that need to be reserved or deleted are precisely identified by the tag field. In combination with the second timestamp, the screening conditions are further refined to ensure that only data that meets a particular time range or state is screened. Setting screening conditions through the mark field and the second timestamp helps to avoid deleting important data by mistake or keeping outdated data, so that the accuracy of the data is improved. According to the result of data screening of the original data table, the distinction of different conditions after the data to be reserved are migrated to the new data table is realized, and the accurate cleaning of the data in the original data table is ensured.
A fifth embodiment of the present application is presented based on the above-described embodiment of the present application. In the fifth embodiment of the present application, the same or similar contents as those of the above-described embodiments can be referred to the above description, and the description thereof will not be repeated.
In this embodiment, according to the result of data screening on the original data table, the step S504 of determining the final data by using the original data table, the temporary data table and the new data table may include steps S1 to S2:
step S1, if the data of the original data table are not screened, determining final data according to the new data table;
The data updating operation of the original data table does not cause the data updating loss of the data to be reserved because the data to be reserved is already migrated to the new data table. The second case is that the data updated by the original data table is the data to be reserved, and the data updating operation is performed by the original data table because the data migration is completed, so that the data updating in the new data table is lost. The updating operation of the original data table after the reserved data is migrated to the new data table is different, and the final data result is also different. Further judgment is required according to the screening result.
If the data of the original data table is not selected, it indicates that the data of the original data table, which is not present in the data to be retained and is after the second time stamp, may be a specific update operation, for example, deleting the data, so that the data of the original data table, which is not present in the data of the original data table and is after the second time stamp, is not present. In this case, the data in the new data table is not lost, and the new data table is the final result table, and no additional processing is required for the new data table.
And S2, if the data of the original data table is screened, inquiring the screened data of the original data table in the temporary data table and the new data table.
If the data with the data time after the second time stamp is screened, the updating operations such as data insertion and data modification are performed on the original data table after the migration of the second data is completed, so that the data with the data time after the second time stamp appears. If the data updated by the original data table is the data which does not need to be reserved, namely, the first situation occurs that the data updating operation of the original data table does not cause the data updating loss of the data to be reserved. If the data updated by the original data table is the data to be reserved, the second situation occurs, and the data updating operation of the original data table causes the data in the new data table to be lost because the data migration is completed. And further distinguishing the two conditions according to the data of the screened original data table, and further distinguishing by means of the temporary data table is needed to realize accurate cleaning of the data.
Specifically, as a possible implementation manner, if the data of the original data table is screened, the step S2 of querying the screened data of the original data table in the temporary data table and the new data table may include: firstly, according to the range and the specific value of the data screened from the original data table, inquiring in the temporary data table through an inquiry statement. And inquiring in the new data table through an inquiry statement according to the range and the specific value of the data screened in the original data table. Further, after the screening data is queried in the temporary data table and the new data table, consistency of the screened data needs to be compared and verified. And ensuring that the query result is matched with the screening data in the original data table, and no omission or error exists. If inconsistent or erroneous data is found during the query process, corresponding processing including data cleansing, data correction, and further data verification is required. Finally, the time stamps of the data screened in the temporary data table and the new data table are compared with the data primary key to determine the final data.
The above is only one possible implementation of step S2 provided in this embodiment, and the specific implementation of step S2 in this embodiment is not specifically limited.
In this embodiment, specific data updating conditions are distinguished according to the result of data screening on the original data table, and accurate cleaning of data in the original data table is ensured by means of the temporary data table, where the temporary data table can be used for storing intermediate results, temporarily storing data to be updated, or recording a state in the cleaning process. Frequent read-write operation is avoided directly on the original data table, the I/O burden of the database is reduced, and the data processing efficiency is improved.
The sixth embodiment of the present application is presented based on the above-described embodiment of the present application. In the sixth embodiment of the present application, the same or similar contents as those of the above-described embodiments can be referred to the above description, and the description thereof will not be repeated.
In this embodiment, if the data of the original data table is screened, step S2 of querying the screened data of the original data table in the temporary data table and the new data table may include steps A1 to A3:
a1, carrying out data query on the temporary data table based on the screened original data table data;
The data of the selected original data table is queried in the temporary data table through query sentences, and the query sentences for querying the data in the temporary data table according to the data of the selected original data table are required to be clear and correct table names and column names, unlike the query sentences for querying the data table according to the screening conditions, and the data consistent with the selected original data table is queried in the temporary data table. To ensure that the returned data row after the query is consistent with the screened raw data table data, a view may be made of the amount of data returned, the specifics, and whether the required column information is contained.
Step A2, if the screened original data table data is not queried, taking the screened original data table data as final data;
Since the data in the temporary data table is the data inserted before the update operation of the new data table, the screened original data table data is traversed in the temporary data table, and if the screened original data table data is not queried in the temporary data table, the data is not updated, and the data is reserved. The new data table is consistent with the record of the data in the original data table, and the data in the new data table is obtained by concurrent migration from the original data table, so that the data in the screened original data table is used as final data, the condition of concurrent migration errors from the original data table is avoided, and the accuracy of the final data is ensured.
And step A3, if the screened original data table data is queried, comparing the original data table, the temporary data table and the new data table field by field to obtain final data.
Traversing the filtered data in the temporary data table, if the filtered data of the original data table is queried in the temporary data table, the data is updated, and the data in the new data table needs to be updated.
It can be understood that the temporary data table is traversed by more than one item of the original data table data obtained after screening, and the temporary data table is searched one item in the temporary data table according to the screened multiple items of the original data table data. After finding a piece of conforming data in the temporary data table, the data needs to be confirmed, and whether the data can be used as final data or not is confirmed and combined into the new data table.
The field is the minimum unit constituting data, and is also one column in the data table. Each field describes a certain feature of the document, namely the data item, and has a unique field identifier for computer identification. In the original data table, each field represents a particular attribute or feature.
Thus, by comparing the fields of the data in the temporary data table, the new data table, and the original data table one by one, it can be determined whether the data items are identical in the different data tables, thereby determining the final data.
Specifically, as a possible implementation manner, if the screened data of the original data table is queried, the step A3 of comparing the original data table, the temporary data table and the new data table one by one to obtain final data may include: first, the fields of the data in the original data table, temporary data table, and new data table are named. The field value on the original data table is T1_TEMP_F, the field value on the temporary data table is T3_F, and the field value on the new data table is T1_F. Further, comparing the data field values in the three data tables respectively, and if the T1_F and the T1_TEMP_F are the same, the field value is T1_F; if the T1_F and T1_TEMP_F values are not the same, then respectively comparing with T3_F; if T1_F and T3_F are the same, then the field value is T1_TEMP_F; conversely, the field value is t1_f; finally, final data is determined according to the field values obtained after field-by-field comparison.
The above is only one possible implementation of the step A3 provided in this embodiment, and the specific implementation of the step A3 in this embodiment is not specifically limited.
In this embodiment, after the filtered data of the original data table is queried, field level comparison is performed on the temporary data table, the new data table and the corresponding data in the original data table to determine the final data. By naming the fields in each table, the fields are more conveniently compared. Through the fine comparison, the accuracy and completeness of final data are ensured, data conflict and inconsistency are avoided, the data processing efficiency and accuracy are improved, and the possible situation of data update loss of a new data table after data update is solved.
The seventh embodiment of the present application is presented based on the above-described embodiment of the present application. In the seventh embodiment of the present application, the same or similar contents as those of the above-described embodiments can be referred to the above description, and the description thereof will not be repeated.
On this basis, referring to fig. 4, fig. 4 is a flow chart of a seventh embodiment of the data cleaning method according to the present application.
In this embodiment, after the step 505 of merging the final data into the new data table to obtain the final result table, the method further includes:
And step 506, deleting the original data table and the temporary data table.
The data in the original data table is migrated to other data tables or systems, a final result table after accurate data cleaning is obtained, once the data migration is completed, and the new data table or system is stably operated, the original data table does not play a role any more, and system resources are released by deleting the original data table.
Temporary data tables are typically used to store temporary data or intermediate results, which are no longer needed after the data processing is completed. Deleting the temporary data table can keep the data management system tidy and avoid data confusion or misuse.
The original data table is deleted using a delete table command provided by the data management system. Before this is performed, it is ensured that the data has been backed up and that no other system or application is using the table. Similar to deleting the original data table, the temporary data table is deleted using the same command or method. Since the temporary data table is typically temporary, the delete operation is more straightforward and simple.
Further, after deleting the data table, it should be verified whether the data table has been successfully deleted. Whether a data table has been successfully deleted may be verified by querying a list of tables in the database or using an interface of a database management tool. In addition to deleting the data tables themselves, the logs, indexes, and metadata associated with these data tables should be cleaned up to ensure the integrity and consistency of the database.
In this embodiment, the original data table and the temporary data table are deleted by a delete table command provided by the data management system, so as to optimize data management, integrate data, keep the database clean, and release system resources for the next data cleaning.
Referring to fig. 5, fig. 5 is a schematic diagram of a flow for cleaning data of an original data table, and a detailed flow for cleaning data by creating a new data table and using a temporary data table according to the present application will be described below with reference to fig. 5.
In step a10, a new data table (hereinafter, referred to as T2) is created by copying the logical structure of the original data table (hereinafter, referred to as T1), and the logical structures of the new data table and the original data table are kept identical, and no index is created on T2.
And step A20, marking data to be cleaned in batches and in small batches concurrently on the T1. After the concurrent marking operation is performed, the marking field of the data to be reserved is a default value, and the marking field of the data to be cleaned is a marking value.
In step a30, a combined index is created on T1, where the specific index field of the combined index is (tag field, timestamp). Since the batch-wise small lot concurrency marks the data to be cleaned, there are different time stamps in the index field.
And step A40, recording a time stamp time1 according to the index field, screening the data to be reserved according to the mark field and the data time stamp smaller than the time stamp 1, and concurrently migrating to T2. The data volume of the step is larger, the time is longer, and the service flow still reaches T1.
And A50, recording a time stamp time2 according to the index field, screening data to be reserved at [ time1, time 2) according to the mark field and the data time stamp, and migrating to T2. The data volume is smaller and the time consumption is smaller in the step, and the service flow still reaches T1.
Step A60, creating a temporary data table T3 according to the logic structure of T1. For the convenience of operation and distinction to follow, table T1 is renamed to t1_temp, T2 is renamed to T1, and an index is created on T1 according to the combined index of t1_temp, with the index field being (tag field, timestamp).
Step A70, performing updating operation on all data on T1 through a preset data updating operation instruction or a data updating operation switch, and inserting the data before the updating operation into a table T3.
Step a80, screening data on t1_temp according to the tag field and the data timestamp > =time2, and if the data is not screened, executing step a90. If the data is screened, traversing the screened data, respectively carrying out data query on T1 and T3 according to the primary key in the data table, and if the data cannot be queried on T3, indicating that the data is not updated after being migrated to a new table, merging the traversed data into T1. If data is found on T3, the final data is determined from 3 pieces of data on T1, T3, T1 TEMP.
Specifically, the piece of data is traversed, field-by-field comparison is performed on the data, and assuming that the field values on T1, T3, and t1_temp are t1_ F, T _ F, T1_temp_f, respectively, if t1_f and t1_temp_f are the same, the field value is t1_f; if the T1_F and T1_TEMP_F values are not the same, then respectively comparing with T3_F; if T1_F and T3_F are the same, then the field value is T1_TEMP_F; conversely, the field value is t1_f; after field-by-field processing, the final result of this piece of data is obtained, and updated into table T1 according to the primary key.
For convenience of understanding, referring to fig. 6, fig. 6 is a schematic diagram of a field comparison method when a temporary data table is used to update new data table. For example: the data field values in T3 are a, b, c, d; the data field values in T1 are a, b, f, d; the data field value in t1_temp is a, e, c, d. According to the field-by-field comparison method, whether the data field values in T1 and T1 TEMP are identical is judged, whether the data field values in T1 and T3 are identical is judged, and finally the data field values of T1 are a, e, f and d.
Step A90, if the data updating operation switch is preset, the switch is closed, and the data insertion into the T3 is stopped. If the data updating operation instruction is preset, after the data updating operation instruction is completed, the system automatically stops updating the data, so that the data is stopped from being inserted into the T3.
Step a11, delete table t1_temp and table T3, table T1 being the final result table. In order to optimize data management, integrate data, and keep the database clean, system resources are released for the next data clean up, deleting temporary and original data tables.
It should be noted that the foregoing examples are only for understanding the present application, and are not to be construed as limiting the data cleaning method of the present application, and that many simple modifications based on the technical idea are within the scope of the present application.
The present application also provides a data cleaning device, referring to fig. 7, the data cleaning device includes:
the data table acquisition module 10 is used for acquiring a raw data table and data to be cleaned in the raw data table;
The original data table is the starting point of data cleaning, the original data table is usually stored in a database, and the data management system acquires the storage position of the original data table, which needs to define the data, and the corresponding access authority. After the data table storage location and access rights are specified, operations need to be performed using a specific query language to obtain the original data table. And obtaining the data to be cleaned in the original data table through the result of data query processing on the original data table.
A data table establishing module 20, configured to establish a new data table and a temporary data table based on the original data table;
According to the logic structure of the original data table, a new data table and a temporary data table which are consistent with the logic structure of the original data table are established, so that the problem of data errors or inconsistency caused by inconsistent structures is avoided, and the accuracy of data cleaning is improved. Temporary data tables are used to store intermediate results when complex query operations are performed. By storing the intermediate results in the temporary data table, the overhead of repeated computation and data conversion can be reduced, thereby improving query performance and accuracy of data processing.
An index field establishing module 30, configured to obtain an index field of the original data table based on the data to be cleaned;
An index field is a special data structure in a database for locating data quickly, a field that can uniquely identify the data or group the data in some specific logic. And determining an index field according to the data to be cleaned by considering the query frequency, the data memory size and the data updating frequency.
A data screening module 40, configured to screen the data in the original data table according to the index field, and migrate the screened data to the new data table;
And writing a database query statement according to the selected index field, wherein the database query statement is written according to the database type. The index field is used for screening the data in the original data table, so that the efficiency of data migration is improved, the data which should be migrated to the new data table is accurately controlled, unnecessary data migration is avoided, the use of storage space is reduced, the data in the new data table is ensured to be accurate and complete, and the problems of high fragment rate and high water level generated after data cleaning are avoided.
The data updating module 50 performs data updating on the new data table by using the index field and the temporary data table to obtain a final result table.
After migrating the data to be retained to the new data table, if the original data table is updated, the data in the new data table may be lost. And the data is cleaned directly in the original data table, and after the data cleaning operation, if the metadata original data table is updated, the situation of data loss can also occur. Therefore, in order to avoid data loss after data update, the data integrity in the data table is maintained, and the new data table is subjected to data update by means of the index field and the temporary data table, so that the data loss after data update is avoided.
The data cleaning device provided by the application can solve the technical problem of data cleaning by adopting the data cleaning method in the embodiment. Compared with the prior art, the beneficial effects of the data cleaning device provided by the application are the same as those of the data cleaning method provided by the embodiment, and other technical features of the data cleaning device are the same as those disclosed by the method of the embodiment, so that the description is omitted.
The application provides a data cleaning device, which comprises: at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor, so that the at least one processor can execute the data cleaning method in the first embodiment.
Referring now to FIG. 8, a schematic diagram of a data cleansing apparatus for implementing an embodiment of the present application is shown. The data cleaning device in the embodiment of the present application may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (Personal DIGITAL ASSISTANT: personal digital assistant), a PAD (Portable Application Description: tablet computer), a PMP (Portable MEDIA PLAYER: portable multimedia player), an in-vehicle terminal (e.g., an in-vehicle navigation terminal), and the like, and a fixed terminal such as a digital TV, a desktop computer, and the like. The data cleaning device shown in fig. 8 is only an example and should not be construed as limiting the functionality and scope of use of embodiments of the present application.
As shown in fig. 8, the data cleaning apparatus may include a processing device 1001 (e.g., a central processing unit, a graphics processor, etc.), which may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 1002 or a program loaded from a storage device 1003 into a random access Memory (RAM: random Access Memory) 1004. In the RAM1004, various programs and data required for the operation of the data cleaning apparatus are also stored. The processing device 1001, the ROM1002, and the RAM1004 are connected to each other by a bus 1005. An input/output (I/O) interface 1006 is also connected to the bus. In general, the following systems may be connected to the I/O interface 1006: input devices 1007 including, for example, a touch screen, touchpad, keyboard, mouse, image sensor, microphone, accelerometer, gyroscope, and the like; an output device 1008 including, for example, a Liquid crystal display (LCD: liquid CRYSTAL DISPLAY), a speaker, a vibrator, and the like; storage device 1003 including, for example, a magnetic tape, a hard disk, and the like; and communication means 1009. The communication means 1009 may allow the data cleaning device to communicate wirelessly or by wire with other devices to exchange data. While the figures illustrate data cleansing devices having various systems, it should be understood that not all illustrated systems are required to be implemented or provided. More or fewer systems may alternatively be implemented or provided.
In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through a communication device, or installed from the storage device 1003, or installed from the ROM 1002. The above-described functions defined in the method of the disclosed embodiment of the application are performed when the computer program is executed by the processing device 1001.
The data cleaning equipment provided by the application can solve the technical problem of data cleaning by adopting the data cleaning method in the embodiment. Compared with the prior art, the beneficial effects of the data cleaning device provided by the application are the same as those of the data cleaning method provided by the embodiment, and other technical features of the data cleaning device are the same as those disclosed by the method of the previous embodiment, and are not repeated here.
It is to be understood that portions of the present disclosure may be implemented in hardware, software, firmware, or a combination thereof. In the description of the above embodiments, particular features, structures, materials, or characteristics may be combined in any suitable manner in any one or more embodiments or examples.
The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.
The present application provides a computer readable storage medium having computer readable program instructions (i.e., a computer program) stored thereon for performing the data cleansing method of the above-described embodiments.
The computer readable storage medium provided by the present application may be, for example, a U disk, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access Memory (RAM: random Access Memory), a Read-Only Memory (ROM: read Only Memory), an erasable programmable Read-Only Memory (EPROM: erasable Programmable Read Only Memory or flash Memory), an optical fiber, a portable compact disc Read-Only Memory (CD-ROM: CD-Read Only Memory), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In this embodiment, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, or device. Program code embodied on a computer readable storage medium may be transmitted using any appropriate medium, including but not limited to: wire, fiber optic cable, RF (Radio Frequency), and the like, or any suitable combination of the foregoing.
The computer readable storage medium may be embodied in a data cleaning device; or may exist alone without being assembled into the data cleaning device.
The computer-readable storage medium carries one or more programs that, when executed by the data cleaning device, cause the data cleaning device to: acquiring an original data table and data to be cleaned in the original data table; establishing a new data table and a temporary data table based on the original data table; obtaining an index field of the original data table based on the data to be cleaned; screening the data in the original data table according to the index field, and transferring the screened data to the new data table; and updating the new data table by utilizing the index field and the temporary data table to obtain a final result table.
Computer program code for carrying out operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of remote computers, the remote computer may be connected to the user's computer through any kind of network, including a local area network (LAN: local Area Network) or a wide area network (WAN: wide Area Network), or may be connected to an external computer (for example, through the Internet using an Internet service provider).
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules involved in the embodiments of the present application may be implemented in software or in hardware. Wherein the name of the module does not constitute a limitation of the unit itself in some cases.
The readable storage medium provided by the application is a computer readable storage medium, and the computer readable storage medium stores computer readable program instructions (i.e. a computer program) for executing the data cleaning method, so that the technical problem of data cleaning can be solved. Compared with the prior art, the beneficial effects of the computer readable storage medium provided by the application are the same as those of the data cleaning method provided by the above embodiment, and are not described herein.
The application also provides a computer program product comprising a computer program which, when executed by a processor, implements the steps of a data cleaning method as described above.
The computer program product provided by the application can solve the technical problem of data cleaning. Compared with the prior art, the beneficial effects of the computer program product provided by the application are the same as those of the data cleaning method provided by the above embodiment, and are not described herein.
The foregoing description is only a partial embodiment of the present application, and is not intended to limit the scope of the present application, and all the equivalent structural changes made by the description and the accompanying drawings under the technical concept of the present application, or the direct/indirect application in other related technical fields are included in the scope of the present application.
Claims (10)
1. A method of data cleansing, the method comprising:
acquiring an original data table and data to be cleaned in the original data table;
Establishing a new data table and a temporary data table based on the original data table;
obtaining an index field of the original data table based on the data to be cleaned;
Screening the data in the original data table according to the index field, and transferring the screened data to the new data table;
And updating the new data table by utilizing the index field and the temporary data table to obtain a final result table.
2. The method of claim 1, wherein the step of deriving an index field of the raw data table based on the data to be cleaned comprises:
establishing a combined index of the original data table according to the data to be cleaned;
An index field is determined based on the combined index.
3. The method of claim 2, wherein the index field includes a tag field, a first timestamp, a second timestamp, wherein the step of filtering the data in the original data table according to the index field and migrating the filtered data to the new data table comprises:
screening the data in the original data table according to the mark field and the first timestamp to obtain screened first data;
migrating the first data to the new data table based on the original data table;
screening the data in the original data table according to the mark field, the first timestamp and the second timestamp to obtain screened second data;
and based on the original data table, migrating the second data to the new data table.
4. The method of claim 3, wherein the step of updating the new data table based on the index field and the temporary data table to obtain a final result table comprises:
Performing data updating operation on the new data table based on a preset data updating instruction;
inserting the data before the data updating operation of the new data table into the temporary data table;
performing data screening on the original data table based on the mark field and the second timestamp;
Determining final data by using the original data table, the temporary data table and the new data table according to the data screening result of the original data table;
and merging the final data into the new data table to obtain a final result table.
5. The method of claim 4, wherein the step of determining final data using the original data table, the temporary data table, and the new data table based on the result of the data screening of the original data table comprises:
if the data of the original data table is not screened, determining final data according to the new data table;
And if the data of the original data table is screened, inquiring the data of the screened original data table in the temporary data table and the new data table.
6. The method of claim 5, wherein if the data of the original data table is screened, the step of querying the temporary data table and the new data table for the screened data of the original data table comprises:
performing data query on the temporary data table based on the data of the screened original data table;
if the data of the screened original data table is not queried, taking the data of the screened original data table as final data;
And if the data of the screened original data table is queried, comparing the original data table, the temporary data table and the new data table one by one to obtain final data.
7. The method of claim 6, wherein the step of merging the final data into the new data table to obtain a final result table further comprises:
And deleting the original data table and the temporary data table.
8. A data cleaning apparatus, the apparatus comprising: a memory, a processor and a computer program stored on the memory and executable on the processor, the computer program being configured to implement the steps of the data cleaning method of any one of claims 1 to 7.
9. A storage medium, characterized in that the storage medium is a computer-readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the data cleaning method according to any one of claims 1 to 7.
10. A computer program product, characterized in that the computer program product comprises a computer program which, when executed by a processor, implements the steps of the data cleaning method according to any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410745972.7A CN118484449A (en) | 2024-06-11 | 2024-06-11 | Data cleaning method, device, storage medium and computer program product |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410745972.7A CN118484449A (en) | 2024-06-11 | 2024-06-11 | Data cleaning method, device, storage medium and computer program product |
Publications (1)
Publication Number | Publication Date |
---|---|
CN118484449A true CN118484449A (en) | 2024-08-13 |
Family
ID=92197232
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410745972.7A Pending CN118484449A (en) | 2024-06-11 | 2024-06-11 | Data cleaning method, device, storage medium and computer program product |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN118484449A (en) |
-
2024
- 2024-06-11 CN CN202410745972.7A patent/CN118484449A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11429641B2 (en) | Copying data changes to a target database | |
US9218377B2 (en) | Failure recovery and error correction techniques for data loading in information warehouses | |
US8832159B2 (en) | Systems and methods for asynchronous schema changes | |
CN108932282B (en) | Database migration method and device and storage medium | |
CN109906448B (en) | Method, apparatus, and medium for facilitating operations on pluggable databases | |
US20180046643A1 (en) | Consistent execution of partial queries in hybrid dbms | |
US9171027B2 (en) | Managing a multi-version database | |
US20160078085A1 (en) | Method and system for adaptively building and updating a column store database from a row store database based on query demands | |
US9390111B2 (en) | Database insert with deferred materialization | |
US10592400B2 (en) | System and method for creating variants in a test database during various test stages | |
CN110399378B (en) | Database system lock operation analysis method and device | |
CN118484449A (en) | Data cleaning method, device, storage medium and computer program product | |
CN114116907A (en) | Database synchronization method and device, electronic equipment and storage medium | |
CN114461454A (en) | Data recovery method and device, storage medium and electronic equipment | |
CN112749156A (en) | Data processing method, database management system and data processing equipment | |
JP7481283B2 (en) | Metadata management device, data management system, and data reproduction method | |
CN117608536B (en) | Gap data online template customization and supplementary recording system and method thereof | |
US20230281009A1 (en) | Managing artifact information including finding a searched artifact information item | |
CN117931819A (en) | Hadoop offline data increment updating method, system and storable medium | |
CN118796953A (en) | Real-time data computing processing method, system, computer equipment and medium | |
CN115687282A (en) | File synchronization method and device, electronic equipment and storage medium | |
CN115840786A (en) | Data lake data synchronization method and device | |
CN117390036A (en) | Incremental refreshing method and device for materialized view crossing database | |
CN102890679A (en) | Method and system for processing data version | |
CN118689910A (en) | Vector indexing method, vector indexing device, electronic equipment and computer readable storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |