CN115481114A - Data cleaning method and device, computer equipment and storage medium - Google Patents

Data cleaning method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN115481114A
CN115481114A CN202211123879.XA CN202211123879A CN115481114A CN 115481114 A CN115481114 A CN 115481114A CN 202211123879 A CN202211123879 A CN 202211123879A CN 115481114 A CN115481114 A CN 115481114A
Authority
CN
China
Prior art keywords
data
target
environment
data table
time
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211123879.XA
Other languages
Chinese (zh)
Inventor
吕泽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Property and Casualty Insurance Company of China Ltd
Original Assignee
Ping An Property and Casualty Insurance Company of China Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Property and Casualty Insurance Company of China Ltd filed Critical Ping An Property and Casualty Insurance Company of China Ltd
Priority to CN202211123879.XA priority Critical patent/CN115481114A/en
Publication of CN115481114A publication Critical patent/CN115481114A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a data cleaning method, which is applied to the technical field of data processing. The method provided by the application comprises the following steps: acquiring a target data cleaning task comprising a target database of a target system and a target data table of the target database; acquiring the number of deployment environments of the target system; splitting the target data cleaning task according to the number of the deployment environments and the hardware resource performance of the deployment environments to obtain a data acquisition task list; executing the data acquisition task according to a preset data acquisition rule and adding the data acquisition task into a data table set to be processed; generating a data processing task list for the target data table according to the target data cleaning task; executing the data processing tasks in the data processing task list, and taking the processed target data table as a data table to be updated; and acquiring the data table to be updated according to a preset data updating rule, and updating the data table to be updated to the deployment environment of the target system.

Description

Data cleaning method and device, computer equipment and storage medium
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a data cleaning method and apparatus, a computer device, and a storage medium.
Background
In the technical field of data processing, a need widely exists for data cleaning of data in a target database of a target system, and dirty data left in history is processed or other aspects of historical data are processed to meet business requirements.
However, in the execution process of the existing data cleaning technology, on one hand, a large access pressure is generated on a database by reading operation of source data, and the database is blocked, so that production accidents are caused; on the other hand, a concurrency problem is caused in the data cleaning process, so that repeated reading and repeated processing operations are caused, and the condition of resource waste is generated.
Disclosure of Invention
The embodiment of the application provides a data cleaning method and device, computer equipment and a storage medium, and aims to solve the problems of database blockage and repeated processing caused by the existing data cleaning technology.
In a first aspect of the present application, a data cleansing method is provided, including:
acquiring a target data cleaning task, wherein the data cleaning task comprises a target database of a target system and a target data table of the target database;
acquiring all deployment environments of the target system, wherein each deployment environment comprises the target database and the target data table;
generating a data acquisition task list, wherein the data acquisition task list is obtained by splitting the target data cleaning task according to the acquired hardware resource performance of the deployment environment;
executing the data acquisition tasks in the data acquisition task list according to a preset data acquisition rule to acquire the target data table, and adding the acquired target data table into a data table set to be processed;
generating a data processing task list, wherein the data processing task list is generated according to different processing modes of the target data table in the target data cleaning task;
executing the data processing tasks in the data processing task list, and adding the processed target data table serving as a data table to be updated into a data table set to be updated;
and acquiring the data table to be updated in the data table set to be updated according to a preset data updating rule, and updating the data table to be updated to all deployment environments of the target system.
In a second aspect of the present application, there is provided a data washing apparatus comprising:
the data cleaning task acquisition module is used for acquiring a target data cleaning task, and the data cleaning task comprises a target database of a target system and a target data table of the target database;
a deployment environment obtaining module, configured to obtain all deployment environments of the target system, where each deployment environment includes the target database and the target data table;
the data acquisition task module is used for generating a data acquisition task list, and the data acquisition task list is obtained by splitting the target data cleaning task according to the acquired hardware resource performance of the deployment environment;
the data acquisition execution module is used for executing the data acquisition tasks in the data acquisition task list according to a preset data acquisition rule to acquire the target data table and adding the acquired target data table into a data table set to be processed;
the data processing task module is used for generating a data processing task list, and the data processing task list is generated according to different processing modes of the target data table in the target data cleaning task;
the data processing execution module is used for executing the data processing tasks in the data processing task list and adding the processed target data table serving as a data table to be updated into a data table set to be updated;
and the data updating module is used for acquiring the data table to be updated in the data table set to be updated according to a preset data updating rule and updating the data table to be updated to all the deployment environments of the target system.
In a third aspect of the present application, a computer device is provided, which comprises a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the data cleansing method when executing the computer program.
In a fourth aspect of the present application, a computer-readable storage medium is provided, which stores a computer program, which when executed by a processor, implements the steps of the above-mentioned data cleansing method.
According to the data cleaning method, the data cleaning device, the computer equipment and the storage medium, a target data cleaning task comprising a target database of a target system and a target data table of the target database is obtained; acquiring the number of deployment environments of the target system; splitting the target data cleaning task according to the number of the deployment environments and the hardware resource performance of the deployment environments to obtain a data acquisition task list; executing the data acquisition task according to a preset data acquisition rule and adding the data acquisition task into a data table set to be processed; generating a data processing task list for the target data table according to the target data cleaning task; executing the data processing tasks in the data processing task list, and taking the processed target data table as a data table to be updated; and acquiring the data table to be updated according to a preset data updating rule, and updating the data table to be updated to the deployment environment of the target system. The method not only reduces the access pressure to the database in the data cleaning process, but also avoids the problems of database blockage and repeated processing of the data cleaning process, and further improves the data cleaning efficiency.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments of the present application will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive exercise.
FIG. 1 is a diagram illustrating an application environment of a data cleansing method according to an embodiment of the present application;
FIG. 2 is a flow chart of a data cleansing method according to an embodiment of the present application;
FIG. 3 is a schematic diagram of a data cleansing apparatus according to an embodiment of the present application;
FIG. 4 is a schematic diagram of a computer device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without making any creative effort belong to the protection scope of the present application.
The data cleaning method provided by the application can be applied to an application environment shown in fig. 1, wherein a computer device can be but not limited to various personal computers and notebook computers, the computer device can also be a server, and the server can be an independent server or a cloud server providing basic cloud computing services such as cloud service, cloud database, cloud computing, cloud functions, cloud storage, network service, cloud communication, middleware service, domain name service, security service, content Delivery Network (CDN), big data and artificial intelligence platform and the like. It will be appreciated that the number of computer devices in figure 1 is merely illustrative and any number of extensions may be made according to actual requirements.
In an embodiment, as shown in fig. 2, a data cleansing method is provided, which is described by taking the computer apparatus in fig. 1 as an example, and includes the following steps S101 to S107:
s101, acquiring a target data cleaning task, wherein the data cleaning task comprises a target database of a target system and a target data table of the target database;
specifically, the data cleaning task is to clean data in the target data table in the target database of the target system, and the cleaning refers to a process of identifying incomplete, incorrect or irrelevant parts of data and then replacing, modifying or deleting dirty data. Data cleansing addresses various problems with data, including but not limited to: data integrity, data legitimacy, data consistency, data uniqueness, data authority. For example, the user data in the user table information lacks information such as age, and for example, the user in the user table information is older than 150 years old, and for example, at least two records appear in the information of the same user in the user information table.
Further, before the acquiring the target data cleaning task, the method further includes: and acquiring a data set contained in the target system, wherein the data set contains the target database and a target data table of the target database. And then partitioning the data set according to the data updating time. And finally, generating a data operation time index of the data set after partition processing. The partitioning of the data set contained in the target system is to divide all the data contained in the target system into regions, and the regions can be managed and accessed independently, and the partitioning can improve the scalability, reduce the contention, and optimize the performance. Further, partitioning the data set contained by the target system includes, but is not limited to: horizontal partition, vertical partition, functional partition. The horizontal partitions are generally referred to as slices, each of which is an independent data storage, and each of which stores a specific subset of data, for example, order data of users exceeding a preset time is stored in one horizontal partition after time sorting. The vertical partition refers to a subset of entry fields stored in the data storage by each partition, for example, fields in the user information that are accessed frequently are stored in one vertical partition, and other fields in the user information that are accessed less frequently are stored in another vertical partition. The functional partition is to store data in the target system in a partitioned manner according to different functions, for example, in an insurance system, insurance record data purchased by a user is stored in one functional partition, and insurance record data taken out by the user is stored in another functional partition. Further, it should be noted that there may be an implementation of combined application in the actual application process of the horizontal partition, the vertical partition, and the functional partition, for example, the data is subjected to horizontal partition processing, and then the subset data after being subjected to horizontal partition processing is subjected to vertical partition processing. After all the data in the target system are subjected to partition processing, the data query and acquisition efficiency in the process of executing the target data cleaning task can be effectively shortened, and the execution efficiency of the data cleaning method in the embodiment is further improved. The data operation time index is a data structure for pre-sorting values of one or more columns, and by using the data operation time index, a database system can directly position records meeting conditions without scanning the whole table, so that the query speed is increased, and the execution efficiency of the data cleaning method in the embodiment is improved.
S102, acquiring all deployment environments of the target system, wherein each deployment environment comprises the target database and the target data table.
Specifically, the acquiring the number of all deployment environments of the target system includes: firstly, a first environment quantity of a formal environment of the target system and a second environment quantity of a gray scale environment of the target system are obtained, and the deployment environment comprises the formal environment and the gray scale environment. The grayscale environment is a commonly used engineering method in software engineering, and specific technical details are not described herein again. Then, a first access frequency is set according to the hardware resource performance of the formal environment, and a second access frequency is set according to the hardware resource performance of the gray-scale environment. Because the performance of hardware resources such as computer devices or computer clusters bearing the formal environment and the grayscale environment is different, different data access frequencies need to be set according to the performance of the hardware resources in different deployment environments, and generally, the hardware resource performance of the formal environment is higher than that of the grayscale environment, so the first access frequency is generally set to be higher than the second access frequency. Further, peak value historical data of a data access peak value of the formal environment is obtained, and the first access frequency is split into a first segment access frequency of at least one frequency segment according to a preset access frequency splitting method according to the peak value historical data. The first access frequency is divided into the first segment access frequency according to the peak value historical data, so that the use experience of a real user in a formal environment is not affected by the execution step of acquiring data when the data cleaning method of the embodiment is executed, for example, if the data cleaning method is executed to acquire data and the peak value of the access of the real user in the formal environment occur at the same time, the performance of hardware resources in the real environment is inevitably insufficient, and further the real user suffers from bad user experiences such as stutter, delay and the like in the process of accessing the formal environment, and the first access frequency reasonably allocated to the real environment according to the peak value historical data of the real environment can effectively avoid the occurrence of similar problems. And finally, configuring an access switch of the formal environment and the gray scale environment, wherein the access switch is used for judging whether the data in the target data table can be acquired from the formal environment or the gray scale environment corresponding to the access switch. Since the formal environment is a deployment environment of a straight-sided user group and the gray-scale environment also has a part of user groups, when a data cleansing task in this implementation performs a data acquisition related step, a certain resource occupation pressure is generated on the formal environment and/or the gray-scale environment, at this time, an access switch for determining whether to access the formal environment and/or the gray-scale environment needs to be designed, and before the data cleansing task performs the data acquisition related step, the access switch is set according to the current access pressure of the formal environment and/or the gray-scale environment, so that excessive pressure is not generated on the formal environment and/or the gray-scale environment.
S103, generating a data acquisition task list, wherein the data acquisition task list is obtained by splitting the target data cleaning task according to the acquired hardware resource performance of the deployment environment.
Specifically, the generating the data obtaining task list includes: firstly, the hardware performance of the formal environment of the first environment quantity and the hardware performance of the gray-scale environment of the second environment quantity are respectively obtained. And meanwhile, acquiring the formal environment and/or the gray environment meeting hardware performance as a data acquisition channel environment according to historical data acquisition task execution records, the data amount of the target data table contained in the target data cleaning task and a preset data acquisition channel setting rule. And selecting the formal environment or the gray scale environment corresponding to the historical time-consuming record with less time consumption from the historical time-consuming records if the historical data acquisition task execution records contain the historical time-consuming records for acquiring different data from the formal environment and the gray scale environment, and simultaneously, it needs to be noted that the data in the historical time-consuming records and the data of the target data table contained in the target data cleaning task are the same data source. In addition to referring to the historical elapsed time record, the growth rate of the data in the target data table needs to be considered, that is, the data amount in the target data table when the data amount in the target data table at the current time is relative to the time node of the historical elapsed time record needs to be considered, for example, when the data amount in the target data table at the current time is greatly increased relative to the data amount in the target data table at the time node of the historical elapsed time record, the data acquisition channel in the target data table is switched from the formal environment to the grayscale environment, so as to avoid causing excessive access pressure to the formal environment. Then, the access switch of the data acquisition channel environment is turned on, and the access switches of the formal environment and/or the gray level environment which do not belong to the data acquisition channel environment are turned off, and the functions and the beneficial effects of the access switches are not repeated herein. And correspondingly splitting the target data cleaning task into data acquisition subtasks according to the preset data acquisition channel setting rule. The number of the data acquisition subtasks may be set according to a hardware resource remaining condition of the computer device in fig. 1, for example, a larger number of the data acquisition subtasks is set when the hardware resource of the computer device is rich, a smaller number of the data acquisition subtasks is set when the hardware resource of the computer device is deficient, and further, when the hardware resource of the computer device is seriously deficient, the step of correspondingly splitting the target data cleaning task into the data acquisition subtasks according to the preset data acquisition channel setting rule enters a state of polling and waiting for the hardware resource of the computer device to be recovered. And finally, associating the data acquisition subtask with the data acquisition channel environment, and adding the data acquisition subtask to the data processing task list.
S104, executing the data acquisition task in the data acquisition task list according to a preset data acquisition rule to acquire the target data table, and adding the acquired target data table into a data table set to be processed.
Further, the executing the data obtaining task in the data obtaining task list according to a preset data obtaining rule to obtain the target data table includes: first, the earliest data operation time in the data operation time index is acquired as an initial time. The earliest data operation time in the data operation time index is used as initial time to acquire data farthest from the current time, and the largest data operation time in the data operation time index can also be used as initial time to acquire data from the current time, where different initial time settings affect different data acquisition sequences in the target data table, but there is a need for acquiring data according to a time sequence in a part of scenes in an actual application scene, for example, only data cleaning is performed on purchase record data of the first three years in a user purchase data table, and no processing or other processing is performed on purchase record data of more than three years. Meanwhile, the initial time is used as the starting time of a first time range, and the sum of the starting time of the first time range and a preset time span is used as the ending time of the first time range. Then, data with data operation time within the first time range are obtained from the target data table. And finally, after the data are successfully acquired each time, adding the preset time to the start time and the end time of the first time range to update the first time, and then acquiring the data in the target data table by using the updated first time range until the data in the target data table are all acquired, namely, after the data in the target data table are successfully acquired according to the first time range each time, increasing or decreasing the preset time by using the time span of the first time range, so that the data in the target data table acquired according to the updated first time range at the next time do not have the same data as the data in the target data table acquired according to the first time range before being updated at the previous time.
Further, the obtaining data with a data operation time within the first time range from the target data table further includes: first, recording a first elapsed time for acquiring data with a data operation time within the first time range from the target data table. And then, judging whether the first consumed time is within a preset time adjustment range, and if so, adjusting the preset time span according to a preset time adjustment rule. Although the efficiency of acquiring data in the target data table has been improved in the foregoing steps by means of partition processing, data indexing, associating a data acquisition channel environment, setting an access switch, and the like, there may still be other problems in an actual data acquisition execution process, for example, when the data acquisition channel environment is the formal environment and data is being acquired from the target data table in the formal environment, the user access amount of the formal environment is increased rapidly, which further reduces the hardware resource remaining situation of the formal environment, which further increases the first time consumption for subsequently acquiring data in the target data table, which indicates that the target database in which the target data table is located is under a certain access pressure, but it is clear that a data cleaning task is generally at a lower priority level than an actual business task, and at this time, in order to not cause an excessive access pressure to a database system on one hand, and in order to make data access to other business tasks on the other hand, the preset time needs to be further adjusted. Namely, whether the first consumed time is within a preset time adjustment range is judged, and if so, the preset time span is adjusted according to a preset time adjustment rule. Further, in addition to the first elapsed time needing to be increased while being increased, the preset time can be decreased while the first elapsed time is decreased, because the access pressure of the target database is not consistently in a peak state, when the access pressure of the target database is decreased, the first elapsed time is decreased, and the preset time is decreased while the access pressure of the target database is kept within a reasonable range, so that the efficiency of acquiring the data of the target data table in the target database is improved.
And S105, generating a data processing task list, wherein the data processing task list is generated according to different processing modes of the target data table in the target data cleaning task.
In the data cleansing method of the present embodiment, different from a method of individually processing each data table in a conventional data cleansing method, different data processing task lists are generated for different processing methods, and the data processing task lists include at least one target data table used for performing a specific data cleansing method. Further, the target data table is contained in at least one of the data processing task lists. For example, a data processing task in one data processing task list is to round a 3-bit decimal number to a 2-bit decimal number in the target data table, a data processing task in another data processing task list is to format-convert all date type fields, and a user data list exists in two data processing tasks in the data processing task list because both a decimal type data field and a date type data field exist.
S106, executing the data processing tasks in the data processing task list, and adding the processed target data table serving as a data table to be updated into a data table set to be updated.
Further, after the data processing tasks in the data processing task list are all executed, result verification is performed on the execution result of the data processing tasks. That is, it is checked whether all the data tables in the set of data tables to be updated have completed the data processing task according to the data processing requirement, and if not completed or partially completed, the data processing task needs to be re-executed.
S107, the data table to be updated in the data table set to be updated is obtained according to a preset data updating rule, and the data table to be updated is updated to all deployment environments of the target system.
The process of updating the to-be-updated data table to the deployment environment of the target system is similar to the process of acquiring the data in the target data table of the target database from the deployment environment of the target system, on one hand, data is acquired by querying the database, on the other hand, the updated data is written into the database, and both access pressure is applied to the target database, so that the design of the preset data updating rule refers to the design rule obtained by the data, so as to keep the target database updating the to-be-updated data table to the deployment environment of the target system within a reasonable access pressure range, and specific technical details are not repeated herein.
According to the data cleaning method provided by the embodiment, a target data cleaning task comprising a target database of a target system and a target data table of the target database is obtained; acquiring the number of deployment environments of the target system; splitting the target data cleaning task according to the number of the deployment environments and the hardware resource performance of the deployment environments to obtain a data acquisition task list; executing the data acquisition task according to a preset data acquisition rule and adding the data acquisition task into a data table set to be processed; generating a data processing task list for the target data table according to the target data cleaning task; executing the data processing tasks in the data processing task list, and taking the processed target data table as a data table to be updated; and acquiring the data table to be updated according to a preset data updating rule, and updating the data table to be updated to the deployment environment of the target system. The method not only reduces the access pressure to the database in the data cleaning process, but also avoids the problems of database blockage and repeated processing of the data cleaning process, and further improves the data cleaning efficiency.
It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by functions and internal logic of the process, and should not constitute any limitation to the implementation process of the embodiments of the present application.
In an embodiment, a data cleansing apparatus 100 is provided, and the data cleansing apparatus 100 corresponds to the data cleansing method in the above embodiments one to one. As shown in fig. 3, the data cleaning apparatus 100 includes a data cleaning task obtaining module 11, a deployment environment obtaining module 12, a data obtaining task module 13, a data obtaining execution module 14, a data processing task module 15, a data processing execution module 16, and a data updating module 17. The functional modules are explained in detail as follows:
the data cleaning task acquisition module 11 is configured to acquire a target data cleaning task, where the data cleaning task includes a target database of a target system and a target data table of the target database;
a deployment environment obtaining module 12, configured to obtain all deployment environments of the target system, where each deployment environment includes the target database and the target data table;
a data acquisition task module 13, configured to generate a data acquisition task list, where the data acquisition task list is obtained by splitting the target data cleaning task according to the acquired hardware resource performance of the deployment environment;
a data acquisition execution module 14, configured to execute the data acquisition task in the data acquisition task list according to a preset data acquisition rule to acquire the target data table, and add the acquired target data table to a set of data tables to be processed;
a data processing task module 15, configured to generate a data processing task list, where the data processing task list is generated according to different processing manners of the target data table in the target data cleaning task;
a data processing execution module 16, configured to execute a data processing task in the data processing task list, and add the processed target data table as a data table to be updated into a data table set to be updated;
and the data updating module 17 is configured to obtain the data table to be updated in the data table set to be updated according to a preset data updating rule, and update the data table to be updated to all deployment environments of the target system.
Further, the data cleaning task obtaining module 11 further includes:
a data set obtaining sub-module, configured to obtain a data set included in the target system, where the data set includes the target database and a target data table of the target database;
the data partition processing submodule is used for carrying out partition processing on the data set according to the data updating time;
and the data index generation submodule is used for generating a data operation time index of the data set after the data set is subjected to partition processing.
Further, the deployment environment acquisition module 12 further includes:
the environment quantity obtaining sub-module is used for obtaining a first environment quantity of a formal environment of the target system and a second environment quantity of a gray level environment of the target system, and the deployment environment comprises the formal environment and the gray level environment;
the access frequency setting submodule is used for setting a first access frequency according to the hardware resource performance of the formal environment and setting a second access frequency according to the hardware resource performance of the gray environment;
and the first access switch setting submodule is used for configuring an access switch of the formal environment and the gray environment, and the access switch is used for judging whether the data in the target data table can be acquired from the formal environment or the gray environment corresponding to the access switch.
Further, the data obtaining task module 13 further includes:
a hardware performance obtaining sub-module, configured to obtain hardware performance of the formal environment in the first environment quantity and hardware performance of the grayscale environment in the second environment quantity, respectively;
a data acquisition channel environment submodule for acquiring the formal environment and/or the gray level environment meeting the hardware performance as a data acquisition channel environment according to a historical data acquisition task execution record, the data amount of the target data table contained in the target data cleaning task and a preset data acquisition channel setting rule;
a second access switch setting submodule for turning on the access switch of the data acquisition channel environment and turning off the access switches of the formal environment and/or the gray scale environment which are not the data acquisition channel environment;
the data cleaning task splitting sub-module is used for correspondingly splitting the target data cleaning task into data acquisition sub-tasks according to the preset data acquisition channel setting rule;
and the task and channel association submodule is used for associating the data acquisition subtask with the data acquisition channel environment and adding the data acquisition subtask to the data processing task list.
Further, the data acquisition execution module 14 further includes:
the initial time submodule is used for acquiring the earliest data operation time in the data operation time index as initial time;
the first time range submodule is used for taking the initial time as the starting time of a first time range and taking the sum of the starting time of the first time range and a preset time span as the ending time of the first time range;
the first data acquisition sub-module is used for acquiring data with data operation time within the first time range from the target data table;
and the data acquisition cycle sub-module is used for simultaneously adding the preset time to the starting time and the ending time of the first time range after the data are successfully acquired each time so as to update the first time range, and then acquiring the data in the target data table by using the updated first time range until the data in the target data table are acquired.
Further, the first data obtaining sub-module further includes:
the first time consumption recording subunit is used for recording first time consumption of data, the data operation time of which is within the first time range, acquired from the target data table;
and the preset time span adjusting subunit is used for judging whether the first consumed time is within a preset time adjusting range, and if so, adjusting the preset time span according to a preset time adjusting rule.
Wherein the meaning of "first" and "second" in the above modules/units is only to distinguish different modules/units, and is not used to define which module/unit has higher priority or other defining meaning. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or modules is not necessarily limited to those steps or modules explicitly listed, but may include other steps or modules not explicitly listed or inherent to such process, method, article, or apparatus, and such that a division of modules presented in this application is merely a logical division and may be implemented in a practical application in a further manner.
For specific limitations of the data cleansing apparatus, reference may be made to the above limitations of the data cleansing method, which are not described herein again. The modules in the data cleaning device can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent of a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a server, and the internal structure thereof may be as shown in fig. 4. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing data involved in the data cleansing method. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a data cleansing method.
In one embodiment, a computer device is provided, which includes a memory, a processor and a computer program stored on the memory and executable on the processor, and when the processor executes the computer program, the steps of the data cleansing method in the above embodiments are implemented, such as steps S101 to S107 shown in fig. 2 and other extensions of the method and related steps. Alternatively, the processor, when executing the computer program, implements the functions of the modules/units of the data washing apparatus in the above-described embodiments, such as the functions of the modules 11 to 17 shown in fig. 3. To avoid repetition, further description is omitted here.
The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field-Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, etc. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, the processor being the control center of the computer apparatus, various interfaces and lines connecting the various parts of the overall computer apparatus.
The memory may be used to store the computer programs and/or modules, and the processor may implement various functions of the computer device by running or executing the computer programs and/or modules stored in the memory and invoking data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, video data, etc.) created according to the use of the cellular phone, etc.
The memory may be integrated in the processor or may be provided separately from the processor.
In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which, when executed by a processor, implements the steps of the data cleansing method in the above-described embodiments, such as the steps S101 to S107 shown in fig. 2 and extensions of other extensions and related steps of the method. Alternatively, the computer program, when executed by the processor, implements the functions of the modules/units of the data cleansing device in the above-described embodiments, such as the functions of the modules 11 to 17 shown in fig. 3. To avoid repetition, further description is omitted here.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), rambus (Rambus) direct RAM (RDRAM), direct Rambus Dynamic RAM (DRDRAM), and Rambus Dynamic RAM (RDRAM), among others.
It should be clear to those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional units and modules is only used for illustration, and in practical applications, the above function distribution may be performed by different functional units and modules as needed, that is, the internal structure of the apparatus may be divided into different functional units or modules to perform all or part of the above described functions.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit and scope of the embodiments of the present application, and they should be construed as being included in the present application.

Claims (10)

1. A method for data cleansing, comprising:
acquiring a target data cleaning task, wherein the data cleaning task comprises a target database of a target system and a target data table of the target database;
acquiring all deployment environments of the target system, wherein each deployment environment comprises the target database and the target data table;
generating a data acquisition task list, wherein the data acquisition task list is obtained by splitting the target data cleaning task according to the acquired hardware resource performance of the deployment environment;
executing the data acquisition tasks in the data acquisition task list according to a preset data acquisition rule to acquire the target data table, and adding the acquired target data table into a data table set to be processed;
generating a data processing task list, wherein the data processing task list is generated according to different processing modes of the target data table in the target data cleaning task;
executing the data processing tasks in the data processing task list, and adding the processed target data table serving as a data table to be updated into a data table set to be updated;
and acquiring the data table to be updated in the data table set to be updated according to a preset data updating rule, and updating the data table to be updated to all deployment environments of the target system.
2. The data cleansing method of claim 1, wherein the obtaining the target data cleansing task is preceded by:
acquiring a data set contained in the target system, wherein the data set contains the target database and a target data table of the target database;
partitioning the data set according to the data updating time;
and generating a data operation time index of the data set after partition processing.
3. The data cleansing method according to claim 1, wherein the acquiring all deployment environments of the target system comprises:
acquiring a first environment quantity of a formal environment of the target system and a second environment quantity of a gray level environment of the target system, wherein the deployment environment comprises the formal environment and the gray level environment;
setting a first access frequency according to the hardware resource performance of the formal environment, and setting a second access frequency according to the hardware resource performance of the gray-scale environment;
and configuring an access switch of the formal environment and the gray scale environment, wherein the access switch is used for judging whether the data in the target data table can be acquired from the formal environment or the gray scale environment corresponding to the access switch.
4. The data cleansing method of claim 3, wherein the generating a data acquisition task list comprises:
respectively acquiring hardware performance of the formal environment in the first environment quantity and hardware performance of the gray-scale environment in the second environment quantity;
acquiring the formal environment and/or the gray environment meeting hardware performance as a data acquisition channel environment according to a historical data acquisition task execution record, the data amount of the target data table contained in the target data cleaning task and a preset data acquisition channel setting rule;
opening the access switch of the data acquisition channel environment, and closing the access switches of the formal environment and/or the gray level environment which are not in the data acquisition channel environment;
correspondingly splitting the target data cleaning task into data acquisition subtasks according to the preset data acquisition channel setting rule;
and associating the data acquisition subtask with the data acquisition channel environment, and adding the data acquisition subtask to the data processing task list.
5. The data cleaning method according to claim 2, wherein the executing the data obtaining task in the data obtaining task list according to the preset data obtaining rule to obtain the target data table comprises:
acquiring the earliest data operation time in the data operation time index as initial time;
taking the initial time as the starting time of a first time range, and taking the sum of the starting time of the first time range and a preset time span as the ending time of the first time range;
acquiring data with data operation time within the first time range from the target data table;
and after data are successfully acquired each time, adding the preset time to the starting time and the ending time of the first time range respectively to update the first time range, and then acquiring the data in the target data table by using the updated first time range until the data in the target data table are acquired.
6. The data cleansing method of claim 5, wherein the obtaining data from the target data table with a data operation time within the first time range further comprises:
recording first consumed time of acquiring data with data operation time within the first time range from the target data table;
and judging whether the first consumed time is within a preset time adjustment range, and if so, adjusting the preset time span according to a preset time adjustment rule.
7. A data cleansing apparatus, comprising:
the data cleaning task acquisition module is used for acquiring a target data cleaning task, wherein the data cleaning task comprises a target database of a target system and a target data table of the target database;
a deployment environment obtaining module, configured to obtain all deployment environments of the target system, where each deployment environment includes the target database and the target data table;
the data acquisition task module is used for generating a data acquisition task list, and the data acquisition task list is obtained by splitting the target data cleaning task according to the acquired hardware resource performance of the deployment environment;
the data acquisition execution module is used for executing the data acquisition tasks in the data acquisition task list according to a preset data acquisition rule to acquire the target data table and adding the acquired target data table into a data table set to be processed;
the data processing task module is used for generating a data processing task list, and the data processing task list is generated according to different processing modes of the target data table in the target data cleaning task;
the data processing execution module is used for executing the data processing tasks in the data processing task list and adding the processed target data table serving as a data table to be updated into a data table set to be updated;
and the data updating module is used for acquiring the data table to be updated in the data table set to be updated according to a preset data updating rule and updating the data table to be updated to all the deployment environments of the target system.
8. The data cleansing apparatus according to claim 7,
a data set obtaining sub-module, configured to obtain a data set included in the target system, where the data set includes the target database and a target data table of the target database;
the data partition processing submodule is used for performing partition processing on the data set according to the data updating time;
and the data index generation submodule is used for generating a data operation time index of the data set after the data set is subjected to partition processing.
9. Computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor realizes the steps of the data cleansing method according to any one of claims 1 to 6 when executing the computer program.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the data cleansing method according to any one of claims 1 to 6.
CN202211123879.XA 2022-09-15 2022-09-15 Data cleaning method and device, computer equipment and storage medium Pending CN115481114A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211123879.XA CN115481114A (en) 2022-09-15 2022-09-15 Data cleaning method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211123879.XA CN115481114A (en) 2022-09-15 2022-09-15 Data cleaning method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN115481114A true CN115481114A (en) 2022-12-16

Family

ID=84392584

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211123879.XA Pending CN115481114A (en) 2022-09-15 2022-09-15 Data cleaning method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115481114A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115840742A (en) * 2023-02-13 2023-03-24 每日互动股份有限公司 Data cleaning method, device, equipment and medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115840742A (en) * 2023-02-13 2023-03-24 每日互动股份有限公司 Data cleaning method, device, equipment and medium
CN115840742B (en) * 2023-02-13 2023-05-12 每日互动股份有限公司 Data cleaning method, device, equipment and medium

Similar Documents

Publication Publication Date Title
CN110535777B (en) Access request control method and device, electronic equipment and readable storage medium
CN110166632B (en) Intelligent outbound call processing method and device, computer equipment and storage medium
CN110096336B (en) Data monitoring method, device, equipment and medium
CN110008665B (en) Authority control method and device for blockchain
CN111953772B (en) Request processing method, device, server and storage medium
CN115481114A (en) Data cleaning method and device, computer equipment and storage medium
CN104572301A (en) Resource distribution method and system
WO2022134797A1 (en) Data fragmentation storage method and apparatus, a computer device, and a storage medium
CN110046100B (en) Packet testing method, electronic device and medium
CN116954685B (en) Gray scale rolling upgrading method, system, equipment and medium for low-code application system
CN111930505B (en) Data resource management method and system for big data platform, server and medium
CN109547253B (en) File downloading method and device, computer equipment and storage medium
CN112698793A (en) Data storage method and device, machine readable medium and equipment
CN111314502A (en) Domain name deployment method and device based on domain name resolution system
CN111428114A (en) Index creating method and device for Elasticissearch search engine
CN112632080B (en) Data storage method, device and equipment based on block chain
CN109525675B (en) Northbound server file downloading method and device, computer equipment and storage medium
CN114238052A (en) Pressure measurement data filtering method and device, storage medium and computer equipment
CN113377652A (en) Test data generation method and device
CN114153594A (en) Content distribution network preheating method, system, electronic equipment and storage medium
CN110290215B (en) Signal transmission method and device
CN114143314A (en) Edge container-based mixed cloud system, method, device and related equipment
CN109951529B (en) Resource management method and device
CN110971637B (en) Method for calling third-party service interface, scheduler and storage medium
CN112989147A (en) Data information pushing method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination