CN107688592B - Data cleaning method and terminal - Google Patents

Data cleaning method and terminal Download PDF

Info

Publication number
CN107688592B
CN107688592B CN201710221427.8A CN201710221427A CN107688592B CN 107688592 B CN107688592 B CN 107688592B CN 201710221427 A CN201710221427 A CN 201710221427A CN 107688592 B CN107688592 B CN 107688592B
Authority
CN
China
Prior art keywords
policy information
concurrent
cleaning
data
configuration table
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710221427.8A
Other languages
Chinese (zh)
Other versions
CN107688592A (en
Inventor
李治
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201710221427.8A priority Critical patent/CN107688592B/en
Priority to PCT/CN2018/074858 priority patent/WO2018184418A1/en
Publication of CN107688592A publication Critical patent/CN107688592A/en
Application granted granted Critical
Publication of CN107688592B publication Critical patent/CN107688592B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention is suitable for the technical field of computers, and provides a data cleaning method and a terminal, wherein the method comprises the following steps: when a cleaning task for traditional reddening data is received, inserting the cleaning task into a preset task execution table, and setting execution time corresponding to the cleaning task; when the execution time reaches, acquiring a scheduling packet through an Oracle database, and performing data cleaning on the policy information in a multi-concurrency mode according to the scheduling packet; and submitting the cleaned policy information in batches and storing the policy information in the Oracle database. The method and the device save the step of file format interconversion, solve the problems of complex process, redundant steps and low efficiency when the traditional reddening data is cleaned in the prior art, effectively improve the efficiency of data cleaning and reduce the total time consumption of data cleaning.

Description

Data cleaning method and terminal
Technical Field
The invention belongs to the technical field of computers, and particularly relates to a data cleaning method and a terminal.
Background
When traditional reddening data is cleaned, in the prior art, a user needs to download data to be cleaned from a database in advance, generate txt files for the data to be cleaned, and then use prophet software for calculation. The files generated by the prophet software can be uploaded to the database only by converting the files into txt files again, and the computing efficiency of the prophet software is low, and the step of cleaning the data usually takes more than 12 hours. As can be seen, the traditional red data cleaning process is complex, the steps are redundant, and the efficiency is very low.
Therefore, there is a need to provide a new technical solution to solve the above technical problems.
Disclosure of Invention
In view of this, the embodiment of the present invention provides a data cleaning method and a terminal, so as to solve the problems of complex process, redundant steps and low efficiency when the conventional reddening data is cleaned in the prior art.
In a first aspect, a method for data cleansing is provided, the method comprising:
when a cleaning task for traditional reddening data is received, inserting the cleaning task into a preset task execution table, and setting execution time corresponding to the cleaning task;
when the execution time reaches, acquiring a scheduling packet through an Oracle database, and performing data cleaning on the policy information in a multi-concurrency mode according to the scheduling packet;
and submitting the cleaned policy information in batches and storing the policy information in the Oracle database.
Further, the data cleaning of the policy information in a multi-concurrent manner includes:
reading a concurrent configuration table, and starting a plurality of concurrent processes according to the concurrent configuration table;
obtaining a remainder between the last digit of the policy number corresponding to the policy information and the total number of the concurrent processes, and distributing the policy information to the concurrent processes corresponding to the remainder;
and reading the policy information to be processed by using a cursor, caching the read policy information into a first preset array, submitting the policy information to the distributed concurrent processes in batches, and performing data cleaning on the policy information by the concurrent processes according to a preset data cleaning algorithm.
Further, the batch submission and storage of the washed policy information into the Oracle database includes:
reading the policy information cleaned by the concurrent process into a second preset array, and submitting the policy information in the second preset array to an Oracle database in batches by adopting a commit command;
and storing the policy information in each batch into a corresponding result table in an Oracle database according to the execution time of the policy information in each batch and the process number corresponding to the concurrent process.
Further, after starting a number of concurrent processes according to the concurrent configuration table, the method further includes:
acquiring the state information of the cleaning task from a log table;
if the state information of the cleaning task is successful, the cleaning task is not executed any more;
and if the state information of the cleaning task is execution failure, deleting the processed data in the plurality of concurrent processes, and re-executing the cleaning task.
Further, before reading the concurrent configuration table, the method further includes:
acquiring a binary configuration table and a policy information basic configuration table, and screening policy information to be cleaned from the Oracle database according to the binary configuration table and the policy information basic configuration table;
and reading a switch configuration table, and removing invalid data in the policy information to be cleaned according to the switch configuration table.
In a second aspect, a terminal is provided, which includes:
the task receiving module is used for inserting the cleaning task into a preset task execution table when the cleaning task of the traditional reddening data is received, and setting the execution time corresponding to the cleaning task;
the concurrent cleaning module is used for acquiring a scheduling packet through an Oracle database when the execution time is up, and cleaning the data of the policy information in a multi-concurrent mode according to the scheduling packet;
and the storage module is used for submitting the cleaned policy information in batches and storing the policy information in the Oracle database.
Compared with the prior art, the embodiment of the invention inserts the cleaning task into a preset task execution table when receiving the cleaning task of the traditional reddening data, and sets the execution time corresponding to the cleaning task; when the execution time reaches, acquiring a scheduling packet through an Oracle database, and performing data cleaning on the policy information in a multi-concurrency mode according to the scheduling packet; and finally, the cleaned policy information is submitted in batches and stored in the Oracle database, so that the step of mutual conversion of file formats is omitted, the data cleaning efficiency is effectively improved, and the total time consumption of data cleaning is reduced.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a flow chart of an implementation of a method for data cleansing according to a first embodiment of the present invention;
fig. 2 is a flowchart illustrating a specific implementation of step S102 in the method for data cleansing according to the first embodiment of the present invention;
fig. 3 is a flowchart illustrating a specific implementation of step S103 in the method for data cleansing according to the first embodiment of the present invention;
fig. 4 is a schematic block diagram of a terminal provided in a second embodiment of the present invention;
fig. 5 is a schematic block diagram of a terminal provided by a third embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The method comprises the steps that when a cleaning task for traditional reddening data is received, the cleaning task is inserted into a preset task execution table, and execution time corresponding to the cleaning task is set; when the execution time reaches, acquiring a scheduling packet through an Oracle database, and performing data cleaning on the policy information in a multi-concurrency mode according to the scheduling packet; and finally, the cleaned policy information is submitted in batches and stored in the Oracle database, so that the step of mutual conversion of file formats is omitted, the data cleaning efficiency is effectively improved, and the total time consumption of data cleaning is reduced. The embodiment of the invention also provides a corresponding terminal, and the detailed description is respectively provided below.
Fig. 1 is a flow chart of implementing a method for data cleansing according to a first embodiment of the present invention.
In the embodiment of the invention, the method for data cleaning is applied to a terminal, and the terminal comprises but is not limited to a computer, a server and the like. Referring to fig. 1, the method of data cleansing includes:
in step S101, when a cleaning task for conventional reddening data is received, the cleaning task is inserted into a preset task execution table, and an execution time corresponding to the cleaning task is set.
In the embodiment of the invention, the terminal acquires the cleaning task of the traditional reddening data according to the triggering operation of the user on the page, inserts the cleaning task into a preset task execution table, and sets the execution time of the task according to the user operation. For example, the task execution table may be a pala _ batch _ play table, and the pala _ batch _ play table includes a scheduled start data field for controlling an execution time. In the embodiment of the present invention, when the cleaning task is inserted into the pala _ batch _ plan table, the value in the scheduled start date field is modified to set the execution time of the cleaning task. After the execution time is set, the cleaning task is allowed to be started only when the current time is greater than or equal to the execution time, so that the user can conveniently arrange the cleaning task, and the CPU resource can be reasonably used.
In step S102, when the execution time reaches, a scheduling packet is obtained through an Oracle database, and data cleaning is performed on policy information in a multi-concurrency manner according to the scheduling packet.
When the execution time is up, the embodiment of the invention obtains a scheduling packet through the Oracle database, wherein the scheduling packet comprises parameter information, flow information and other information related to the execution of the cleaning task. And then starting a plurality of concurrent processes according to the related information in the scheduling packet, and performing data cleaning on the policy information in the traditional bonus data through the concurrent processes.
Illustratively, as mentioned above, the Oracle database automatically acquires the scheduling packet according to the execution time in the scheduled start date field, and starts a plurality of concurrent processes to automatically execute the cleaning task when the execution time is reached.
Compared with the prior art, the embodiment of the invention optimizes the cleaning process, comprises the steps of adopting a multi-concurrency mode to clean data, and automatically screening out the policy information to be processed by each concurrent process, thereby ensuring that the policy information cannot be repeatedly executed.
Optionally, fig. 2 shows a specific implementation flow of data cleansing for policy information in a multi-concurrency manner according to the first embodiment of the present invention. Referring to fig. 2, the data cleansing of policy information in a multi-concurrency manner includes:
in step S201, a concurrency configuration table is read, and a plurality of concurrent processes are started according to the concurrency configuration table.
In the embodiment of the present invention, the concurrency configuration table is used for the service engineer to configure the number of concurrent processes. The terminal can start a corresponding number of concurrent processes according to the concurrent configuration table to prepare for data cleaning.
Optionally, the preparing action before reading the concurrent configuration table further includes filtering policy information, and the method further includes:
acquiring a binary configuration table and a policy information basic configuration table, and screening policy information to be cleaned from the Oracle database according to the binary configuration table and the policy information basic configuration table;
and reading a switch configuration table, and removing invalid data in the policy information to be cleaned according to the switch configuration table.
In the embodiment of the invention, different policy information may be generated by different sales organizations, and the sales organizations in the embodiment of the invention are divided into a primary organization and a secondary organization according to administrative divisions. The binary configuration table is used to distinguish which sales organization the policy information belongs to. The policy basic information configuration table is used for recording which dangerous types in the policy information need to be subjected to data cleaning. The embodiment of the invention primarily screens out the policy information to be cleaned by combining the binary configuration table and the basic policy information configuration table so as to reduce the data cleaning operation on invalid policy information. And the binary configuration table and the policy information basic configuration table are also used as the basis for subsequent data cleaning.
The switch configuration table is dynamically configured by service personnel and used for recording invalid data aiming at each policy information by the service personnel. And the terminal screens invalid data which are not concerned by the service staff from the policy information according to the switch configuration table and inserts the invalid data into the data statistical table so as to wait for the service staff to perform the next operation. For example, the information of the name field, the age field, the gender field, the telephone field and the academic information field of the insurance policy information is included in the certain type of insurance policy information, and the switch options are set in the switch configuration table for the name field, the age field, the gender field, the telephone field and the academic information field of the insurance policy information. If the waiter considers that the study information field is irrelevant data, the selection item of the study information field can be closed in the switch configuration table, and the terminal only reads the information of the name field, the age field, the gender field and the telephone field of the insurance carrier in the policy information according to the switch configuration table.
Here, the policy information to be processed is screened out based on the binary configuration table, the basic configuration table of policy information, and the switch configuration table, and the policy information and irrelevant data which do not need to be cleaned are excluded in the cleaning preparation stage, so that the cleaning workload is reduced, and the efficiency of data cleaning is further improved.
Optionally, the preparation action after starting several concurrent processes according to the concurrent configuration table may further include state judgment, and the method further includes:
acquiring the state information of the cleaning task from a log table;
if the state information of the cleaning task is successful, the cleaning task is not executed any more;
and if the state information of the cleaning task is execution failure, deleting the processed data in the plurality of concurrent processes, and re-executing the cleaning task.
Here, the embodiment of the present invention determines the state information of the cleaning task corresponding to the current execution time, and when the cleaning task is successfully executed, the cleaning task is not executed again; deleting the data run by the concurrent thread when the execution of the cleaning task fails, and jumping to the step S202 to execute the cleaning task again; when the cleaning task is not executed, jumping to step S202 to execute the cleaning task; therefore, repeated execution of the cleaning task is avoided, and time consumption and CPU resource consumption are reduced.
In step S202, a remainder between the last digit of the policy number corresponding to the policy information and the total number of concurrent processes is obtained, and the policy information is assigned to the concurrent processes corresponding to the remainder.
The embodiment of the invention sets the corresponding process number for each concurrent process. After the policy information to be processed is screened out, the embodiment of the invention allocates the processing process corresponding to the policy information according to the policy number corresponding to the policy information. Firstly, acquiring policy information to be processed and a policy number corresponding to the policy information; and then, obtaining a remainder between the last digit of the policy number and the total number of the concurrent processes, and finally distributing the policy information to the concurrent processes with process numbers as the remainder according to the remainder, wherein the concurrent processes are the processing processes of the policy information. Exemplarily, if the currently pending policy number is 201702008, the total number of the started concurrent processes is 3, and the numbers are 0, 1, and 2, respectively; then the remainder between the last digit 8 of the policy number and the total number of concurrent processes 3 is 2 and the policy information for the policy number 201702008 is assigned to the concurrent process with process number 2. By analogy, corresponding processing processes are distributed for all policy information to be processed, so that each policy information is guaranteed to have a corresponding processing process, the condition that the policy information is repeatedly cleaned is avoided, and the efficiency of data cleaning is improved.
In step S203, the policy information to be processed is read by using the cursor, the read policy information is cached in a first preset array, and is submitted to the allocated concurrent processes in batches, and the concurrent processes perform data cleaning on the policy information according to a preset data cleaning algorithm.
Here, embodiments of the present invention use cursors to read pending policy information from the Oracle database. And when the number of the read data information reaches a specified threshold value, the read data information is taken as a batch and submitted to a corresponding concurrent process for processing, and the concurrent process performs data cleaning according to a preset data cleaning algorithm. Alternatively, the specified threshold may be 5000 strips/batch. The specific codes are as follows:
FETCH c_pol_ind BULK COLLECT INTO v_pol_ind LIMIT 5000;
and after 5000 pieces of data are read, submitting the 5000 pieces of data information from the first preset array to a concurrent process for cleaning, thereby being beneficial to reducing the consumption of I/O ports.
In step S103, the washed policy information is submitted in batches and stored in the Oracle database.
Optionally, fig. 3 shows a specific implementation flow of step S103 in the method for data cleansing according to the first embodiment of the present invention. Referring to fig. 3, the step S103 includes:
in step S301, the policy information cleaned by the concurrent process is read out to a second preset array, and the policy information in the second preset array is submitted to the Oracle database in batches by using a commit command.
After the concurrent process finishes cleaning the data information, the embodiment of the invention reads out the cleaned data information from the concurrent process and caches the data information to a second preset array. Similarly, when the number of the read data reaches a specified threshold, the data information in the second preset array is used as a batch and submitted to an Oracle database for storage. Alternatively, the specified threshold may be 5000 strips/batch. And after 5000 pieces of data are read out in a circulating mode, 5000 pieces of data information are read out from the second preset array to an Oracle database for storage, and the consumption of I/O ports can be further reduced.
In step S302, the policy information in each batch is stored in the corresponding result table in the Oracle database according to the execution time of the policy information in each batch and the process number of the corresponding concurrent process.
In the embodiment of the invention, the Oracle database comprises a primary partition divided according to execution time and a secondary partition divided according to process number. The result table comprises two proc data fields and an Order num field, wherein the two fields are used for determining the partition to which the data belongs, the proc data field represents execution time, and the Order num field represents a process number. Each piece of data information read from the concurrent process is provided with two pieces of attribute information, namely a proc date field and an Order num field. When the commit command submits the data information of one batch to the Oracle database, the data information can be accurately stored to the corresponding result table according to the proc date field and the Order num field corresponding to each piece of data information. The embodiment of the invention stores the cleaned policy information in a partition mode, thereby facilitating the deletion of the policy information in the historical time and improving the query efficiency of the policy information cleaned at this time.
It should be understood that, in the above embodiments, the order of execution of the steps is not meant to imply any order, and the order of execution of the steps should be determined by their function and inherent logic, and should not limit the implementation process of the embodiments of the present invention.
Fig. 4 shows a schematic block diagram of a terminal provided in a second embodiment of the present invention, and only shows portions related to the embodiment of the present invention for convenience of explanation.
In an embodiment of the present invention, the terminal is used to implement the method for data cleansing in any one of the embodiments of fig. 1 to fig. 3, and may be a software unit, a hardware unit, or a unit combining software and hardware. Including but not limited to computers, servers, etc.
Referring to fig. 4, the terminal includes:
the task receiving module 41 is configured to, when a cleaning task for traditional reddening data is received, insert the cleaning task into a preset task execution table, and set an execution time corresponding to the cleaning task;
the concurrent cleaning module 42 is used for acquiring a scheduling packet through an Oracle database when the execution time is reached, and performing data cleaning on the policy information in a multi-concurrent mode according to the scheduling packet;
and the storage module 43 is configured to batch submit the cleaned policy information and store the policy information in the Oracle database.
In the embodiment of the present invention, the task receiving module 41 obtains a cleaning task for the traditional reddening data according to a trigger operation of a user on a page, and inserts the cleaning task into a preset task execution table to set an execution time of the task. For example, the task execution table may be a pala _ batch _ play table, and the pala _ batch _ play table includes a scheduled start data field for controlling an execution time. According to the embodiment of the invention, when the cleaning task is inserted into the pala _ batch _ plan table, the value in the scheduled start date field is modified according to the user operation, so as to set the execution time of the cleaning task. After the execution time is set, the cleaning task is allowed to be started only when the current time is greater than or equal to the execution time, so that the user can conveniently arrange the cleaning task, and the CPU resource can be reasonably used.
When the execution time is reached, the concurrent cleaning module 42 obtains the scheduling packet through the Oracle database. The scheduling packet includes parameter information, flow information and other information related to the execution of the cleaning task. And then starting a plurality of concurrent processes according to the related information in the scheduling packet, and performing data cleaning on the policy information through the concurrent processes.
For example, as mentioned above, the concurrent cleaning module 42 may automatically obtain a scheduling packet according to the execution time in the scheduled start date field, and start several concurrent processes to automatically execute the cleaning task when the execution time is reached.
Compared with the prior art, the embodiment of the invention optimizes the cleaning process, comprises the steps of adopting a multi-concurrency mode to clean data, and automatically screening out the policy information to be processed by each concurrent process, thereby ensuring that the policy information cannot be repeatedly executed. The concurrent cleansing module 42 further includes:
a starting unit 421, configured to read a concurrent configuration table, and start a plurality of concurrent processes according to the concurrent configuration table;
the allocating unit 422 is configured to obtain a remainder between a last digit of a policy number corresponding to policy information and a total number of concurrent processes, and allocate the policy information to the concurrent processes corresponding to the remainder;
the cleaning unit 423 is configured to read policy information to be processed by using the cursor, cache the read policy information in a first preset array, submit the policy information to the allocated concurrent processes in batches, and perform data cleaning on the policy information by the concurrent processes according to a preset data cleaning algorithm.
In the embodiment of the present invention, the concurrency configuration table is used for the service engineer to configure the number of concurrent processes. The terminal can start a corresponding number of concurrent processes according to the concurrent configuration table to prepare for data cleaning.
Optionally, the preparation action before reading the concurrent configuration table in the embodiment of the present invention may further include screening policy information, and the terminal further includes:
the screening module 44 is configured to obtain a binary configuration table and a basic policy information configuration table before reading the concurrent configuration table, and screen policy information to be cleaned from the Oracle database according to the binary configuration table and the basic policy information configuration table; and reading a switch configuration table, and removing invalid data in the policy information to be cleaned according to the switch configuration table.
In the embodiment of the invention, different policy information may be generated by different sales organizations, and the sales organizations in the embodiment of the invention are divided into a primary organization and a secondary organization according to administrative divisions. The binary configuration table is used to distinguish which sales organization the policy information belongs to. The policy basic information configuration table is used for recording which dangerous types in the policy information need to be subjected to data cleaning. The policy information to be cleaned is primarily screened out based on the binary configuration table and the basic policy information configuration table, and data cleaning operation on invalid policy information is reduced. And the binary configuration table and the policy information basic configuration table are also used as the basis for subsequent data cleaning.
And the switch configuration table is dynamically configured by service personnel and used for the service personnel to record invalid data. And the terminal screens invalid data which are not concerned by the service staff from the policy information according to the switch configuration table and inserts the invalid data into the data statistical table so as to wait for the service staff to perform the next operation. The policy information to be processed is screened out based on the binary configuration table, the basic policy information configuration table and the switch configuration table, and the policy information which does not need to be cleaned is excluded in the cleaning preparation stage, so that the cleaning workload is reduced, and the efficiency of data cleaning is further improved.
Optionally, the preparation action after starting several concurrent processes according to the concurrent configuration table may further include state judgment, and the terminal further includes:
the state identification module 45 is configured to obtain state information of the cleaning task from a log table after a plurality of concurrent processes are started according to the concurrent configuration table; if the state information of the cleaning task is successful, the cleaning task is not executed any more; and if the state information of the cleaning task is execution failure, deleting the processed data in the plurality of concurrent processes, and re-executing the cleaning task.
Here, the embodiment of the present invention determines the state information of the cleaning task corresponding to the current execution time, and when the cleaning task is successfully executed, the cleaning task is not executed again; if the execution of the cleaning task fails, deleting the data run by the concurrent thread, and jumping to the distribution unit 422 to execute the cleaning task again; when the cleaning task is not executed, jumping to the distribution unit 422 to execute the cleaning task; therefore, repeated execution of the cleaning task is avoided, and time consumption and CPU resource consumption are reduced.
For the policy information to be processed, in the embodiment of the present invention, the allocating unit 422 allocates the processing procedure corresponding to the policy information according to the policy number corresponding to the policy information. First, the allocation unit 422 obtains policy information to be processed and a policy number corresponding to the policy information; and then, obtaining a remainder between the last digit of the policy number and the total number of the concurrent processes, and finally distributing the policy information to the concurrent processes with process numbers as the remainder according to the remainder, wherein the concurrent processes are the processing processes of the policy information. Exemplarily, if the currently pending policy number is 201702008, the total number of the started concurrent processes is 3, and the numbers are 0, 1, and 2, respectively; then the remainder between the last digit 8 of the policy number and the total number of concurrent processes 3 is 2 and the policy information for policy number 201702008 is assigned to the concurrent process with process number 2. By analogy, corresponding processing processes are distributed for all policy information to be processed, so that each policy information is guaranteed to have a corresponding processing process, the condition that the policy information is repeatedly cleaned is avoided, and the efficiency of data cleaning is improved.
After the process allocation is completed, the cleaning unit 423 reads the pending policy information from the Oracle database using the cursor according to the embodiment of the present invention. And when the number of the read data information reaches a specified threshold value, the read data information is taken as a batch and submitted to a corresponding concurrent process for processing, and the concurrent process performs data cleaning according to a preset data cleaning algorithm. Alternatively, the specified threshold may be 5000 strips/batch. The specific codes are as follows:
FETCH c_pol_ind BULK COLLECT INTO v_pol_ind LIMIT 5000;
and after 5000 pieces of data are read, submitting the 5000 pieces of data information from the first preset array to a concurrent process for cleaning so as to reduce the consumption of I/O ports.
Further, the storage module 43 further includes:
the submitting unit 431 is used for reading the policy information cleaned by the concurrent process into a second preset array, and submitting the policy information in the second preset array to an Oracle database in batches by adopting a commit command;
the storage unit 432 is configured to store the policy information in each batch into a corresponding result table in the Oracle database according to the execution time of the policy information in each batch and the process number of the corresponding concurrent process.
After the concurrent process completes the cleaning of the data information, in the embodiment of the present invention, the commit unit 431 reads out the cleaned data information from the concurrent process, and buffers the data information into a second preset array. Similarly, when the number of the read data pieces reaches a specified threshold, the read data information is submitted to an Oracle database for storage as a batch. Alternatively, the specified threshold may be 5000 strips/batch. And reading 5000 pieces of data information from the second preset array to an Oracle database for storage after reading 5000 pieces of data circularly, so as to reduce the consumption of I/O ports.
In the embodiment of the invention, the Oracle database comprises a primary partition divided according to execution time and a secondary partition divided according to process number. The result table comprises two proc data fields and an Order num field, wherein the two fields are used for determining the partition to which the data belongs, the proc data field represents execution time, and the Order num field represents a process number. Each piece of data information read from the concurrent process is provided with two pieces of attribute information, namely a proc date field and an Order num field. When the commit command submits data information of a batch to the Oracle database, the storage unit 432 can accurately store the data information to the corresponding result table according to the proc date field and the Order num field corresponding to each piece of data information. The embodiment of the invention stores the cleaned policy information in a partition mode, thereby facilitating the deletion of the policy information in the historical time and improving the query efficiency of the policy information cleaned at this time.
It should be noted that the terminal in the embodiment of the present invention may be configured to implement all technical solutions in the foregoing method embodiments, and the functions of each functional module may be implemented specifically according to the method in the foregoing method embodiments, and the specific implementation process may refer to the relevant description in the foregoing example, which is not described herein again.
In summary, in the embodiment of the present invention, when a cleaning task for traditional reddening data is received, the cleaning task is inserted into a preset task execution table, and an execution time corresponding to the cleaning task is set; when the execution time reaches, acquiring a scheduling packet through an Oracle database, and performing data cleaning on the policy information in a multi-concurrency mode according to the scheduling packet; and finally, the cleaned policy information is submitted in batches and stored in the Oracle database, so that the step of mutual conversion of file formats is omitted, the data cleaning efficiency is effectively improved, and the total time consumption of data cleaning is reduced.
In order to better implement the method embodiments of the present invention, the present invention further provides a related terminal for implementing the method embodiments. Fig. 5 is a schematic block diagram of a terminal according to a third embodiment of the present invention. The terminal as shown in the figure may include: one or more processors 501 (only one shown); one or more input devices 502 (only one shown), one or more output devices 503 (only one shown), and a memory 504. The processor 501, the input device 502, the output device 503, and the memory 504 are connected by a bus 506. The input device 502 is used for receiving a cleaning task of traditional reddening data; the memory 504 is used for storing program codes; the processor 501 is configured to execute the program code stored in the memory to perform the following operations:
when a cleaning task for traditional reddening data is received, inserting the cleaning task into a preset task execution table, and setting execution time corresponding to the cleaning task; when the execution time reaches, acquiring a scheduling packet through an Oracle database, and performing data cleaning on the policy information in a multi-concurrency mode according to the scheduling packet; and submitting the cleaned policy information in batches and storing the policy information in the Oracle database.
Further, the data cleaning of the policy information in a multi-concurrent manner includes:
reading a concurrent configuration table, and starting a plurality of concurrent processes according to the concurrent configuration table;
obtaining a remainder between the last digit of the policy number corresponding to the policy information and the total number of the concurrent processes, and distributing the policy information to the concurrent processes corresponding to the remainder;
and reading the policy information to be processed by using a cursor, caching the read policy information into a first preset array, submitting the policy information to the distributed concurrent processes in batches, and performing data cleaning on the policy information by the concurrent processes according to a preset data cleaning algorithm.
Further, the batch submission and storage of the washed policy information into the Oracle database includes:
reading the policy information cleaned by the concurrent process into a second preset array, and submitting the policy information in the second preset array to an Oracle database in batches by adopting a commit command;
and storing the policy information in each batch into a corresponding result table in an Oracle database according to the execution time of the policy information in each batch and the process number corresponding to the concurrent process.
Further, the processor 501 is further configured to:
after a plurality of concurrent processes are started according to the concurrent configuration table, acquiring the state information of the cleaning task from a log table;
if the state information of the cleaning task is successful, the cleaning task is not executed any more;
and if the state information of the cleaning task is execution failure, deleting the processed data in the plurality of concurrent processes, and re-executing the cleaning task.
Further, the processor 501 is further configured to:
before reading a concurrent configuration table, acquiring a binary configuration table and a policy information basic configuration table, and screening policy information to be cleaned from the Oracle database according to the binary configuration table and the policy information basic configuration table;
and reading a switch configuration table, and removing invalid data in the policy information to be cleaned according to the switch configuration table.
It should be understood that, in the embodiment of the present invention, the Processor 501 may be a Central Processing Unit (CPU) and/or a Graphics Processing Unit (GPU), and may also be combined with other general purpose processors, Digital Signal Processors (DSP), Application Specific Integrated Circuits (ASIC), Field-Programmable Gate arrays (FPGA) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, and the like.
The input device 502 may include a touch pad, a fingerprint sensor (for collecting fingerprint information of a user and direction information of the fingerprint), a microphone, a communication module (such as a Wi-Fi module, a 2G/3G/4G network module), a physical button, and the like.
The output device 503 may include a display (LCD, etc.), speakers, etc. The display may be used, among other things, to display information entered by or provided to the user. The Display may include a Display panel, and optionally, the Display panel may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like. Further, the touch pad may be overlaid on the display, and when the touch pad detects a touch operation thereon or nearby, the touch pad is transmitted to the processor 501 to determine the type of the touch event, and then the processor 501 provides a corresponding visual output on the display according to the type of the touch event.
In a specific implementation, the processor 501, the input device 502, the output device 503, and the memory 504 described in the embodiments of the present invention may execute the implementation manner described in the embodiments of the method for data cleansing provided in the embodiments of the present invention, and details are not described here again.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the embodiments provided in the present application, it should be understood that the disclosed method and terminal can be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units and modules in the embodiments of the present invention may be integrated into one processing unit, or each unit and module may exist alone physically, or two or more units and modules may be integrated into one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (8)

1. A method of data cleansing, the method comprising:
when a cleaning task for traditional reddening data is received, inserting the cleaning task into a preset task execution table, and setting execution time corresponding to the cleaning task;
when the execution time reaches, acquiring a scheduling packet through an Oracle database, and performing data cleaning on the policy information in a multi-concurrency mode according to the scheduling packet;
submitting the cleaned policy information in batches and storing the policy information in the Oracle database;
the data cleaning of the policy information by adopting a multi-concurrency mode comprises the following steps:
reading a concurrent configuration table, and starting a plurality of concurrent processes according to the concurrent configuration table;
obtaining a remainder between the last digit of the policy number corresponding to the policy information and the total number of the concurrent processes, and distributing the policy information to the concurrent processes corresponding to the remainder;
and reading the policy information to be processed by using a cursor, caching the read policy information into a first preset array, submitting the policy information to the distributed concurrent processes in batches, and performing data cleaning on the policy information by the concurrent processes according to a preset data cleaning algorithm.
2. The method of data cleansing of claim 1, wherein batch submitting and storing the cleansed policy information into the Oracle database comprises:
reading the policy information cleaned by the concurrent process into a second preset array, and submitting the policy information in the second preset array to an Oracle database in batches by adopting a commit command;
and storing the policy information in each batch into a corresponding result table in an Oracle database according to the execution time of the policy information in each batch and the process number corresponding to the concurrent process.
3. The method of data cleansing as claimed in claim 1, wherein after starting a number of concurrent processes according to the concurrent configuration table, the method further comprises:
acquiring the state information of the cleaning task from a log table;
if the state information of the cleaning task is successful, the cleaning task is not executed any more;
and if the state information of the cleaning task is execution failure, deleting the processed data in the plurality of concurrent processes, and re-executing the cleaning task.
4. The method of data cleansing of claim 1, wherein prior to reading the concurrent configuration table, the method further comprises:
acquiring a binary configuration table and a policy information basic configuration table, and screening policy information to be cleaned from the Oracle database according to the binary configuration table and the policy information basic configuration table;
and reading a switch configuration table, and removing invalid data in the policy information to be cleaned according to the switch configuration table.
5. A terminal, characterized in that the terminal comprises:
the task receiving module is used for inserting the cleaning task into a preset task execution table when the cleaning task of the traditional reddening data is received, and setting the execution time corresponding to the cleaning task;
the concurrent cleaning module is used for acquiring a scheduling packet through an Oracle database when the execution time is up, and cleaning the data of the policy information in a multi-concurrent mode according to the scheduling packet;
the storage module is used for submitting the cleaned policy information in batches and storing the policy information into the Oracle database;
the concurrent cleaning module includes:
the starting unit is used for reading the concurrent configuration table and starting a plurality of concurrent processes according to the concurrent configuration table;
the distribution unit is used for solving a remainder between the last digit of the policy number corresponding to the policy information and the total number of the concurrent processes, and distributing the policy information to the concurrent processes corresponding to the remainder;
and the cleaning unit is used for reading the policy information to be processed by using the cursor, caching the read policy information into a first preset array, submitting the policy information to the distributed concurrent processes in batches, and cleaning the policy information by the concurrent processes according to a preset data cleaning algorithm.
6. The terminal of claim 5, wherein the storage module comprises:
the submitting unit is used for reading the policy information cleaned by the concurrent process into a second preset array and submitting the policy information in the second preset array to an Oracle database in batches by adopting a commit command;
and the storage unit is used for storing the policy information in each batch into a corresponding result table in the Oracle database according to the execution time of the policy information in each batch and the process number corresponding to the concurrent process.
7. The terminal of claim 5, wherein the terminal further comprises:
the state identification module is used for acquiring the state information of the cleaning task from the log table after starting a plurality of concurrent processes according to the concurrent configuration table; if the state information of the cleaning task is successful, the cleaning task is not executed any more; and if the state information of the cleaning task is execution failure, deleting the processed data in the plurality of concurrent processes, and re-executing the cleaning task.
8. The terminal of claim 5, wherein the terminal further comprises:
the screening module is used for acquiring a binary configuration table and a policy information basic configuration table before reading the concurrent configuration table, and screening policy information to be cleaned from the Oracle database according to the binary configuration table and the policy information basic configuration table; and reading a switch configuration table, and removing invalid data in the policy information to be cleaned according to the switch configuration table.
CN201710221427.8A 2017-04-06 2017-04-06 Data cleaning method and terminal Active CN107688592B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201710221427.8A CN107688592B (en) 2017-04-06 2017-04-06 Data cleaning method and terminal
PCT/CN2018/074858 WO2018184418A1 (en) 2017-04-06 2018-01-31 Data cleaning method, terminal and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710221427.8A CN107688592B (en) 2017-04-06 2017-04-06 Data cleaning method and terminal

Publications (2)

Publication Number Publication Date
CN107688592A CN107688592A (en) 2018-02-13
CN107688592B true CN107688592B (en) 2020-03-17

Family

ID=61152355

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710221427.8A Active CN107688592B (en) 2017-04-06 2017-04-06 Data cleaning method and terminal

Country Status (2)

Country Link
CN (1) CN107688592B (en)
WO (1) WO2018184418A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112925772A (en) * 2019-12-06 2021-06-08 北京沃东天骏信息技术有限公司 Data dynamic splitting method and device
CN111597180A (en) * 2020-05-19 2020-08-28 山东汇贸电子口岸有限公司 Data cleaning method of OTRS system based on storage process
CN112800043A (en) * 2021-02-05 2021-05-14 凯通科技股份有限公司 Internet of things terminal information extraction method, device, equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8060700B1 (en) * 2008-12-08 2011-11-15 Nvidia Corporation System, method and frame buffer logic for evicting dirty data from a cache using counters and data types
CN103593352A (en) * 2012-08-15 2014-02-19 阿里巴巴集团控股有限公司 Method and device for cleaning mass data
CN106202346A (en) * 2016-06-29 2016-12-07 浙江理工大学 A kind of data load and clean engine, dispatch and storage system
CN106294745A (en) * 2016-08-10 2017-01-04 东方网力科技股份有限公司 Big data cleaning method and device
CN106294492A (en) * 2015-06-08 2017-01-04 深圳中兴网信科技有限公司 Data cleaning method and cleaning engine

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103514205A (en) * 2012-06-27 2014-01-15 中国电信股份有限公司 Mass data processing method and system
CN103942104A (en) * 2014-04-23 2014-07-23 北京金山网络科技有限公司 Task managing method and device
CN105205105B (en) * 2015-08-27 2019-04-16 浪潮集团有限公司 A kind of ETL process system and processing method based on storm
CN105787008A (en) * 2016-02-23 2016-07-20 浪潮通用软件有限公司 Data deduplicating and cleaning method for large data volume
CN106484915B (en) * 2016-11-03 2019-10-11 国家电网公司信息通信分公司 A kind of cleaning method and system of mass data

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8060700B1 (en) * 2008-12-08 2011-11-15 Nvidia Corporation System, method and frame buffer logic for evicting dirty data from a cache using counters and data types
CN103593352A (en) * 2012-08-15 2014-02-19 阿里巴巴集团控股有限公司 Method and device for cleaning mass data
CN106294492A (en) * 2015-06-08 2017-01-04 深圳中兴网信科技有限公司 Data cleaning method and cleaning engine
CN106202346A (en) * 2016-06-29 2016-12-07 浙江理工大学 A kind of data load and clean engine, dispatch and storage system
CN106294745A (en) * 2016-08-10 2017-01-04 东方网力科技股份有限公司 Big data cleaning method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"基于软件总线模型的数据清洗系统的研究与实现";赵鹏;《中国优秀硕士学位论文全文数据库 信息科技辑》;20111215;论文正文第3章、第4章 *

Also Published As

Publication number Publication date
CN107688592A (en) 2018-02-13
WO2018184418A1 (en) 2018-10-11

Similar Documents

Publication Publication Date Title
US11023372B2 (en) Application memory reclaim method and apparatus
US10331524B2 (en) Optimizing data backup schedules
US9886311B2 (en) Job scheduling management
CN107688592B (en) Data cleaning method and terminal
CN109062512A (en) A kind of distributed storage cluster, data read-write method, system and relevant apparatus
US9501313B2 (en) Resource management and allocation using history information stored in application's commit signature log
CN109885565B (en) Data table cleaning method and device
CN104798056A (en) Offloading touch processing to a graphics processor
CN113391944A (en) Deferred server recovery in a computing system
CN112148700A (en) Log data processing method and device, computer equipment and storage medium
CN114613523A (en) Doctor allocation method, device, storage medium and equipment for on-line medical inquiry
CN102057359A (en) Cache memory device, cache memory control method, program, and integrated circuit
WO2023240830A1 (en) Information management method and apparatus, and device and medium
CN108932106A (en) Solid state hard disk access method and the device for using this method
CN114513545B (en) Request processing method, device, equipment and medium
US20150332280A1 (en) Compliant auditing architecture
US20240041395A1 (en) Automated health review system
CN111158595B (en) Enterprise-level heterogeneous storage resource scheduling method and system
CN107577962A (en) Method, system and the relevant apparatus that a kind of more algorithms of cipher card perform side by side
WO2013140412A1 (en) A method and system for distributed computing of jobs
CN103876836A (en) Method and device for controlling checking queues
US11513862B2 (en) System and method for state management of devices
CN107580030A (en) A kind of data managing method, device and server
US10156961B1 (en) Dynamically building a visualization filter
US10693494B2 (en) Reducing a size of multiple data sets

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant