CN107688592A - The method and terminal of data cleansing - Google Patents

The method and terminal of data cleansing Download PDF

Info

Publication number
CN107688592A
CN107688592A CN201710221427.8A CN201710221427A CN107688592A CN 107688592 A CN107688592 A CN 107688592A CN 201710221427 A CN201710221427 A CN 201710221427A CN 107688592 A CN107688592 A CN 107688592A
Authority
CN
China
Prior art keywords
policy information
concurrent
cleaning
cleaning task
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710221427.8A
Other languages
Chinese (zh)
Other versions
CN107688592B (en
Inventor
李治
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201710221427.8A priority Critical patent/CN107688592B/en
Priority to PCT/CN2018/074858 priority patent/WO2018184418A1/en
Publication of CN107688592A publication Critical patent/CN107688592A/en
Application granted granted Critical
Publication of CN107688592B publication Critical patent/CN107688592B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention is applied to field of computer technology, there is provided the method and terminal of a kind of data cleansing, methods described include:When receiving the cleaning task to traditional payout data, the cleaning task is inserted in default tasks carrying table, and sets and the time is performed corresponding to the cleaning task;When reaching the execution time, scheduling bag is obtained by oracle database, is wrapped according to the scheduling, data cleansing is carried out to policy information by the way of how concurrent;Policy information after cleaning is submitted and stored into the oracle database in batches.The step of mutually being changed present invention eliminates file format, solve process complexity, step redundancy, efficiency low problem when prior art is cleaned to traditional payout data, be effectively improved the efficiency of data cleansing, the totality for reducing data cleansing takes.

Description

The method and terminal of data cleansing
Technical field
The invention belongs to the method and terminal of field of computer technology, more particularly to a kind of data cleansing.
Background technology
When being cleaned to traditional payout data, it is to be cleaned that prior art needs user to be downloaded in advance from database Data, the data to be cleaned are generated into txt file, then calculated using prophet softwares.And prophet softwares The file of generation needs to be again converted into txt file to be uploaded in database, and the operation efficiency of prophet softwares It is low, often also to spend more than 12 hours in cleaning data the step.It can be seen that the cleaning process complexity of traditional payout data, Step redundancy, efficiency are very low.
Therefore, it is necessary to a kind of new technical scheme is provided, to solve above-mentioned technical problem.
The content of the invention
In consideration of it, the embodiments of the invention provide a kind of method of data cleansing and terminal, to solve prior art to passing Process complexity, step redundancy, efficiency low problem when system payout data is cleaned.
First aspect, there is provided a kind of method of data cleansing, methods described include:
When receiving the cleaning task to traditional payout data, the cleaning task is inserted into default tasks carrying table In, and set and the time is performed corresponding to the cleaning task;
When reaching the execution time, scheduling bag is obtained by oracle database, wrapped according to the scheduling, using more Concurrent mode carries out data cleansing to policy information;
Policy information after cleaning is submitted and stored into the oracle database in batches.
Further, it is described that policy information progress data cleansing is included by the way of how concurrent:
Concurrent allocation list is read, some concurrent processes are started according to the concurrent allocation list;
The remainder between the last figure of number of policy corresponding to policy information and concurrent process sum is asked for, by the declaration form Information distributes to concurrent process corresponding to the remainder;
Pending policy information is read using vernier, the policy information read is cached into the first preset group, And distributed concurrent process is committed in batches, by the concurrent process according to default data cleansing algorithm to the declaration form Information carries out data cleansing.
Further, the policy information by after cleaning is submitted and stored into the oracle database and wraps in batches Include:
Policy information after concurrent process is cleaned is read out in the second preset group, using commit orders in batches Policy information in second preset group is committed in oracle database;
Numbered according to the process for performing time and corresponding concurrent process of the policy information in every batch, by the batch Policy information is stored into oracle database in corresponding result table.
Further, after some concurrent processes are started according to the concurrent allocation list, methods described also includes:
The status information of the cleaning task is obtained from log sheet;
If the status information of the cleaning task no longer performs this cleaning task to run succeeded;
If the status information of the cleaning task is deleted processed in some concurrent processes to perform failure Data, and re-execute this cleaning task.
Further, before concurrent allocation list is read, methods described also includes:
A binary allocation list and policy information basic configuration table are obtained, according to the binary allocation list and policy information base This allocation list filters out policy information to be cleaned from the oracle database;
Read switch allocation list, the invalid number in the policy information to be cleaned is removed according to the switchgear distribution table According to.
Second aspect, there is provided a kind of terminal, the terminal include:
Task receiving module, for when receiving the cleaning task to traditional payout data, the cleaning task to be inserted Enter in default tasks carrying table, and set and the time is performed corresponding to the cleaning task;
Concurrent cleaning module, for when reaching the execution time, scheduling bag to be obtained by oracle database, according to The scheduling bag, data cleansing is carried out by the way of how concurrent to policy information;
Memory module, for submitting and storing into the oracle database policy information after cleaning in batches.
Compared with prior art, the embodiment of the present invention is by when receiving the cleaning task to traditional payout data, inciting somebody to action The cleaning task is inserted in default tasks carrying table, and sets and the time is performed corresponding to the cleaning task;Held when described When the row time reaches, then scheduling bag is obtained by oracle database, wrapped according to the scheduling, to protecting by the way of how concurrent Single information carries out data cleansing;Finally the policy information after cleaning is submitted and stored to the oracle database in batches In, so as to eliminate the step of file format is mutually changed, the efficiency of data cleansing is effectively improved, reduces data cleansing Totality take.
Brief description of the drawings
In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing There is the required accompanying drawing used in technology description to be briefly described, it should be apparent that, drawings in the following description are only this Some embodiments of invention, for those of ordinary skill in the art, on the premise of not paying creative work, can be with Other accompanying drawings are obtained according to these accompanying drawings.
Fig. 1 is the implementation process figure of the method for the data cleansing that first embodiment of the invention provides;
Fig. 2 be first embodiment of the invention provide data cleansing method in step S102 specific implementation flow chart;
Fig. 3 be first embodiment of the invention provide data cleansing method in step S103 specific implementation flow chart;
Fig. 4 is the schematic block diagram for the terminal that second embodiment of the invention provides;
Fig. 5 is the schematic block diagram for the terminal that third embodiment of the invention provides.
Embodiment
In order to make the purpose , technical scheme and advantage of the present invention be clearer, it is right below in conjunction with drawings and Examples The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.
The embodiment of the present invention is by the way that when receiving the cleaning task to traditional payout data, the cleaning task is inserted In default tasks carrying table, and set and the time is performed corresponding to the cleaning task;When reaching the execution time, then lead to Cross oracle database and obtain scheduling bag, wrapped according to the scheduling, it is clear to carry out data to policy information by the way of how concurrent Wash;Finally the policy information after cleaning is submitted and stored into the oracle database in batches, so as to eliminate file The step of form is mutually changed, the efficiency of data cleansing is effectively improved, the totality for reducing data cleansing takes.The present invention Embodiment additionally provides corresponding terminal, is described in detail individually below.
Fig. 1 is the implementation process of the method for the data cleansing that first embodiment of the invention provides.
In embodiments of the present invention, the method for the data cleansing is applied in terminal, and the terminal includes but is not limited to Computer, server etc..Refering to Fig. 1, the method for the data cleansing includes:
In step S101, when receiving the cleaning task to traditional payout data, the cleaning task is inserted pre- If tasks carrying table in, and set and the time performed corresponding to the cleaning task.
In embodiments of the present invention, terminal is obtained to the clear of traditional payout data according to trigger action of the user on the page Task is washed, the cleaning task is inserted into default tasks carrying table, and according to the execution of user's operating and setting task Time.Exemplarily, the tasks carrying table can be pala_batch_plan tables, be wrapped in the pala_batch_plan tables Include the scheduled start date fields that the time is performed for controlling.The embodiment of the present invention is inserted by the cleaning task While entering to the pala_batch_plan tables, the value in the scheduled start date fields is changed, to set The execution time of the cleaning task.After the execution time is set up, only it is more than or equal to the execution time in current time When, the cleaning task just allows to start, and arranges cleaning task so as to facilitate user, is advantageous to reasonable employment cpu resources.
In step s 102, when reaching the execution time, scheduling bag is obtained by oracle database, according to described Scheduling bag, data cleansing is carried out by the way of how concurrent to policy information.
When reaching the execution time, the embodiment of the present invention obtains scheduling bag, the scheduling by oracle database Include the information related to cleaning task execution such as parameter information, procedure information in bag.Then according in the scheduling bag Relevant information, start several concurrent processes, the policy information in traditional payout data is entered by the concurrent process Row data cleansing.
Exemplarily, as it was previously stated, the oracle database is according in the scheduled start date fields The execution time, it is when reaching the execution time, then automatic to obtain scheduling bag, start multiple concurrent processes and perform cleaning automatically Task.
Compared with prior art, cleaning process is optimized the embodiment of the present invention, including by the way of how concurrent Data cleansing is carried out, pending policy information is selected by each concurrent process Automatic sieve, so as to ensure that policy information will not It is repeatedly executed.
Alternatively, Fig. 2 shows being carried out by the way of how concurrent to policy information for first embodiment of the invention offer The specific implementation flow of data cleansing.It is described that data cleansing bag is carried out to policy information by the way of how concurrent refering to Fig. 2 Include:
In step s 201, concurrent allocation list is read, some concurrent processes are started according to the concurrent allocation list.
In embodiments of the present invention, the concurrent allocation list is used for the quantity of business personnel's configuration concurrency process.Terminal can root Start the concurrent process of respective numbers according to the concurrent allocation list, think that data cleansing is prepared.
Alternatively, the warming-up exercise before concurrent allocation list is read also includes the screening to policy information, methods described It can also include:
A binary allocation list and policy information basic configuration table are obtained, according to the binary allocation list and policy information base This allocation list filters out policy information to be cleaned from the oracle database;
Read switch allocation list, the invalid number in the policy information to be cleaned is removed according to the switchgear distribution table According to.
In the embodiment of the present invention, different policy informations may be produced by different marketing organizations, in the embodiment of the present invention The marketing organization is divided for first-level machine structure and secondary facility according to administrative division.The one binary allocation list, which is used to distinguish, to be protected Which marketing organization is single information belong to.The declaration form essential information allocation list then needs for recording which of policy information insurance kind Carry out data cleansing.The embodiment of the present invention goes out with reference to the binary allocation list and policy information basic configuration table preliminary screening Policy information to be cleaned, operated with the data cleansing reduced to invalid policy information.One binary allocation list and the declaration form letter Breath basic configuration table also serves as the basis of follow-up data cleaning.
The switchgear distribution table is by business personnel's dynamic configuration, for business personnel's recording needle in each policy information Invalid data.Terminal filters out the unconcerned invalid data of business personnel according to the switchgear distribution table and is inserted into from policy information In data statistic, the operation of next step is carried out with standby service personnel.For example include insurer's name in certain class policy information Field, age field, sex field, phone field, the information of academic information field, it is such declaration form in the switchgear distribution table Insurer's name field, age field, sex field in information, phone field, academic information field are provided with switch option. If business personnel thinks that academic information field is extraneous data, the learning information word can be closed in the switchgear distribution table The option of section, terminal then only read insurer's name field in such policy information, age word according to the switchgear distribution table Section, sex field, the information of phone field.
Herein, the embodiment of the present invention is based on the binary allocation list, policy information basic configuration table and switched to match somebody with somebody Put table and filter out pending policy information, eliminated in advance in the cleaning preparatory stage policy information that need not be cleaned and Unrelated data, and then reduce the workload of cleaning, be advantageous to further improve the efficiency of data cleansing.
Alternatively, warming-up exercise can also include shape after some concurrent processes are started according to the concurrent allocation list State judges that methods described also includes:
The status information of the cleaning task is obtained from log sheet;
If the status information of the cleaning task no longer performs this cleaning task to run succeeded;
If the status information of the cleaning task is deleted processed in some concurrent processes to perform failure Data, and re-execute this cleaning task.
Herein, the embodiment of the present invention is by differentiating the status information of cleaning task corresponding to current execution time, in institute When stating cleaning task and running succeeded, this cleaning task is no longer performed;Failure, which is performed, in the cleaning task then deletes simultaneously hair line The data that journey was run, jump to step S202 and re-execute cleaning task;When the cleaning task is not carried out, then jump to Step S202 performs cleaning task;Cleaning task is repeated so as to avoid, advantageously reduces consuming and the CPU of time The consumption of resource.
In step S202, ask for remaining between the last figure of number of policy corresponding to policy information and concurrent process sum Number, concurrent process corresponding to the remainder is distributed to by the policy information.
The embodiment of the present invention is that each concurrent process is provided with corresponding process numbering.Filtering out pending declaration form letter After breath, number of policy according to corresponding to policy information of the embodiment of the present invention distributes treatment progress corresponding to the policy information. First, pending policy information and its corresponding number of policy are obtained;Then ask for the last figure of the number of policy with it is concurrent Remainder between process sum, finally according to the remainder by the policy information distribute to process numbering be the remainder and Hair process, the concurrent process are the treatment progress of the policy information.Exemplarily, if currently pending number of policy is 201702008, the concurrent process sum started is 3, and numbering is 0,1,2 respectively;Then the last figure 8 of the number of policy with it is concurrent Remainder between process sum 3 is 2, then by the policy information that the number of policy is 201702008 distribute to process numbering be 2 and Hair process.The like, for treatment progress corresponding to the distribution of institute pending processing policy information, so as to ensure that each declaration form Information has corresponding treatment progress, avoids policy information by the situation of repeated washing, is advantageous to improve the effect of data cleansing Rate.
In step S203, pending policy information is read using vernier, the policy information read is cached to In one preset group, and distributed concurrent process is committed in batches, by the concurrent process according to default data cleansing Algorithm carries out data cleansing to the policy information.
Herein, the embodiment of the present invention reads pending policy information using vernier from oracle database.Often read Take a data information to be first put into the first preset group to cache, when the bar number of reading reaches specified threshold, then will be read Data message as a batch, be committed in the lump corresponding to concurrent process handled, by the concurrent process according to pre- If data cleansing algorithm carry out data cleansing.Alternatively, the specified threshold can be 5000/batch.Specific code is such as Under:
FETCH c_pol_ind BULK COLLECT INTO v_pol_ind LIMIT 5000;
After 5000 datas are read, then the 5000 data information is carried in the lump from first preset group During friendship is cleaned to concurrent process, the consumption of I/O ports is advantageously reduced.
In step s 103, the policy information after cleaning is submitted and stored into the oracle database in batches.
Alternatively, Fig. 3 show first embodiment of the invention provide data cleansing method in step S103 it is specific Implementation process.Refering to Fig. 3, the step S103 includes:
In step S301, the policy information after concurrent process is cleaned is read out in the second preset group, used Policy information in second preset group is committed in oracle database by commit orders in batches.
After concurrent process is completed to the cleaning of data message, the embodiment of the present invention reads warp from the concurrent process The data message of over cleaning, and the data message is cached into the second preset group.Similarly, when the bar number of reading reaches During specified threshold, then using the data message in second preset group as a batch, it is committed to oracle database and enters Row storage.Alternatively, the specified threshold can be 5000/batch.After circulation reads 5000 datas, then by described in 5000 data information read out to oracle database from second preset group and stored in the lump, can further subtract The consumption of few I/O ports.
In step s 302, compiled according to the process for performing time and corresponding concurrent process of the policy information in every batch Number, the policy information in the batch is stored into oracle database in corresponding result table.
In embodiments of the present invention, the oracle database includes according to the one-level subregion for performing time division and pressed Shine into the secondary partition of journey numbering division.As a result table includes two proc date fields for being used to determine the affiliated subregion of data With Order num fields, the proc date fields represent to perform the time, and the Order num fields represent process numbering. The every a data information read from concurrent process all possesses two attribute letters of proc date fields and Order num fields Breath.When the data message of a batch is committed to the oracle database by commit orders, then according to per a data Proc date fields corresponding to information and Order num fields exactly can store the data information to corresponding knot Fruit table.The embodiment of the present invention is stored by the way of subregion to the policy information after cleaning, when both having facilitated to history Between on policy information deleted, also improve to this cleaning policy information search efficiency.
It should be understood that in the above-described embodiments, the size of the sequence number of each step is not meant to the priority of execution sequence, each step Rapid execution sequence should determine that the implementation process without tackling the embodiment of the present invention forms any limit with its function and internal logic It is fixed.
Fig. 4 shows the schematic block diagram for the terminal that second embodiment of the invention provides, and for convenience of description, illustrate only The part related to the embodiment of the present invention.
In embodiments of the present invention, the terminal is used to realize that above-mentioned Fig. 1 is clear to the data described in Fig. 3 any embodiments The method washed, can be the unit of software unit, hardware cell or software and hardware combining.The terminal includes but is not limited to calculate Machine, server etc..
Refering to Fig. 4, the terminal includes:
Task receiving module 41, for when receiving the cleaning task to traditional payout data, by the cleaning task Insert in default tasks carrying table, and set and the time is performed corresponding to the cleaning task;
Concurrent cleaning module 42, for when reaching the execution time, obtaining scheduling bag by oracle database, pressing According to the scheduling bag, data cleansing is carried out to policy information by the way of how concurrent;
Memory module 43, for submitting and storing to the oracle database policy information after cleaning in batches In.
In embodiments of the present invention, the task receiving module 41 is obtained to passing according to trigger action of the user on the page The cleaning task of system payout data, and the cleaning task is inserted into default tasks carrying table, set holding for the task The row time.Exemplarily, the tasks carrying table can be pala_batch_plan tables, in the pala_batch_plan tables Include the scheduled start date fields that the time is performed for controlling.The embodiment of the present invention is by the cleaning task While being inserted into the pala_batch_plan tables, operated according to user and change the scheduled start date words Value in section, to set the execution time of the cleaning task.After the execution time is set up, only it is more than or waits in current time When the execution time, the cleaning task just allows to start, and arranges cleaning task so as to facilitate user, is advantageous to rationally make With cpu resources.
When reaching the execution time, the concurrent cleaning module 42 then obtains scheduling bag by oracle database. Wherein, the information related to cleaning task execution such as parameter information, procedure information is included in the scheduling bag.Then according to institute The relevant information in scheduling bag is stated, starts several concurrent processes, line number is entered to the policy information by the concurrent process According to cleaning.
Exemplarily, as it was previously stated, the concurrent cleaning module 42 can be according to the scheduled start date words It is the execution time in section, when reaching the execution time, then automatic to obtain scheduling bag, start several concurrent processes and hold automatically Row cleaning task.
Compared with prior art, cleaning process is optimized the embodiment of the present invention, including by the way of how concurrent Data cleansing is carried out, pending policy information is selected by each concurrent process Automatic sieve, so as to ensure that policy information will not It is repeatedly executed.The concurrent cleaning module 42 also includes:
Start unit 421, for reading concurrent allocation list, some concurrent processes are started according to the concurrent allocation list;
Allocation unit 422, for asking between the last figure of number of policy corresponding to policy information and concurrent process sum Remainder, the policy information is distributed into concurrent process corresponding to the remainder;
Cleaning unit 423, for reading pending policy information using vernier, by the policy information read cache to In first preset group, and distributed concurrent process is committed in batches, it is clear according to default data by the concurrent process Wash algorithm and data cleansing is carried out to the policy information.
In embodiments of the present invention, the concurrent allocation list is used for the quantity of business personnel's configuration concurrency process.Terminal can root Start the concurrent process of respective numbers according to the concurrent allocation list, think that data cleansing is prepared.
Alternatively, warming-up exercise of the embodiment of the present invention before concurrent allocation list is read can also include to policy information Screening, the terminal also includes:
Screening module 44, for before concurrent allocation list is read, one binary allocation list of acquisition and policy information to be matched somebody with somebody substantially Table is put, is filtered out according to the binary allocation list and policy information basic configuration table from the oracle database to be cleaned Policy information;Read switch allocation list, it is invalid in the policy information to be cleaned to be removed according to the switchgear distribution table Data.
In the embodiment of the present invention, different policy informations may be produced by different marketing organizations, in the embodiment of the present invention The marketing organization is divided for first-level machine structure and secondary facility according to administrative division.The one binary allocation list, which is used to distinguish, to be protected Which marketing organization is single information belong to.The declaration form essential information allocation list then needs for recording which of policy information insurance kind Carry out data cleansing.The embodiment of the present invention is based on the binary allocation list and policy information basic configuration table preliminary screening goes out Policy information to be cleaned, advantageously reduce the data cleansing operation to invalid policy information.The one binary allocation list and guarantor Single information basic configuration table also serves as the basis of follow-up data cleaning.
The switchgear distribution table is by business personnel's dynamic configuration, for business personnel's protocol failure data.Terminal is opened according to this Close allocation list to filter out the unconcerned invalid data of business personnel from policy information and be inserted into data statistic, with standby service Personnel carry out the operation of next step.The embodiment of the present invention be based on the binary allocation list, policy information basic configuration table and Switchgear distribution table filters out pending policy information, excludes the declaration form letter that need not be cleaned in advance in the cleaning preparatory stage Breath, and then reduce the workload of cleaning, be advantageous to further improve the efficiency of data cleansing.
Alternatively, the warming-up exercise after some concurrent processes are started according to the concurrent allocation list can also include Condition adjudgement, the terminal also include:
State recognition module 45, for after some concurrent processes are started according to the concurrent allocation list, from daily record The status information of the cleaning task is obtained in table;If the status information of the cleaning task no longer performs to run succeeded This cleaning task;If the status information of the cleaning task is deleted in some concurrent processes to perform failure Reduced data, and re-execute this cleaning task.
Herein, the embodiment of the present invention is by differentiating the status information of cleaning task corresponding to current execution time, in institute When stating cleaning task and running succeeded, this cleaning task is no longer performed;Failure, which is performed, in the cleaning task then deletes simultaneously hair line The data that journey was run, jump to allocation unit 422 and re-execute cleaning task;When the cleaning task is not carried out, then jump Go to allocation unit 422 and perform cleaning task;Cleaning task is repeated so as to avoid, advantageously reduces the consumption of time Take and the consumption of cpu resource.
For the pending policy information, the embodiment of the present invention is by the allocation unit 422 according to policy information pair The number of policy answered distributes treatment progress corresponding to the policy information.First, allocation unit 422 obtains pending declaration form letter Breath and its corresponding number of policy;Then the remainder between the last figure of the number of policy and concurrent process sum is asked for, finally The policy information is distributed to by the concurrent process that process numbering is the remainder according to the remainder, the concurrent process is institute State the treatment progress of policy information.Exemplarily, if currently pending number of policy is 201702008, what is started is concurrent Process sum is 3, and numbering is 0,1,2 respectively;Then the remainder between the last figure 8 of the number of policy and concurrent process sum 3 is 2, then the policy information that the number of policy is 201702008 is distributed to the concurrent process for being 2 to process numbering.The like, for institute Pending processing policy information distribution corresponding to treatment progress, so as to ensure that each policy information have corresponding to handle into Journey, policy information is avoided by the situation of repeated washing, be advantageous to improve the efficiency of data cleansing.
After course allocation is completed, the embodiment of the present invention uses vernier from Oracle data by the cleaning unit 423 Pending policy information is read in storehouse.Often reading a data information, which is first put into the first preset group, caches, when reading When bar number reaches specified threshold, then using the data message read as a batch, corresponding concurrent process is committed in the lump Handled, data cleansing is carried out according to default data cleansing algorithm by the concurrent process.Alternatively, the specified threshold It can be 5000/batch.Specific code is as follows:
FETCH c_pol_ind BULK COLLECT INTO v_pol_ind LIMIT 5000;
After 5000 datas are read, then the 5000 data information is carried in the lump from first preset group During friendship is cleaned to concurrent process, to reduce the consumption of I/O ports.
Further, the memory module 43 also includes:
Unit 431 is submitted, for the policy information after concurrent process is cleaned to be read out in the second preset group, is used Policy information in second preset group is committed in oracle database by commit orders in batches;
Memory cell 432, for the process for performing time and corresponding concurrent process according to the policy information in every batch Numbering, the policy information in the batch is stored into oracle database in corresponding result table.
After concurrent process is completed to the cleaning of data message, the embodiment of the present invention submits unit 431 from institute by described Data message of the reading through over cleaning in concurrent process is stated, and the data message is cached into the second preset group.Equally Ground, when the bar number of reading reaches specified threshold, then using the data message read as a batch, it is committed to Oracle numbers Stored according to storehouse.Alternatively, the specified threshold can be 5000/batch.After circulation reads 5000 datas, then The 5000 data information is read out into oracle database in the lump from second preset group to be stored, to reduce The consumption of I/O ports.
In embodiments of the present invention, the oracle database includes according to the one-level subregion for performing time division and pressed Shine into the secondary partition of journey numbering division.As a result table includes two proc date fields for being used to determine the affiliated subregion of data With Order num fields, the proc date fields represent to perform the time, and the Order num fields represent process numbering. The every a data information read from concurrent process all possesses two attribute letters of proc date fields and Order num fields Breath.When the data message of a batch is committed to the oracle database by commit orders, the memory cell 432 According to proc date fields corresponding to every a data information and Order num fields, you can exactly deposit the data information Store up to corresponding result table.The embodiment of the present invention is stored by the way of subregion to the policy information after cleaning, both side The policy information on historical time is deleted, also improve to this cleaning policy information search efficiency.
It should be noted that the terminal in the embodiment of the present invention can be used for realizing whole skills in above method embodiment Art scheme, the function of its each functional module can be implemented according to the method in above method embodiment, and it is implemented Process can refer to the associated description in examples detailed above, and here is omitted.
In summary, the embodiment of the present invention, will be described clear by when receiving the cleaning task to traditional payout data Wash task to insert in default tasks carrying table, and set and the time is performed corresponding to the cleaning task;When the execution time During arrival, then scheduling bag is obtained by oracle database, wrapped according to the scheduling, to policy information by the way of how concurrent Carry out data cleansing;Finally the policy information after cleaning is submitted and stored into the oracle database in batches, so as to The step of file format is mutually changed is eliminated, is effectively improved the efficiency of data cleansing, reduces the totality of data cleansing It is time-consuming.
For the ease of preferably implementing the above method embodiment in the embodiment of the present invention, present invention also offers for Close the associated terminal for implementing to perform above method embodiment.Fig. 5 provides the schematic of the terminal of third embodiment of the invention offer Block diagram.The terminal as depicted can include:One or more processors 501 (only show one) in figure;It is one or more Input equipment 502 (only shows one) in figure, one or more output equipments 503 (one is only shown in figure), memory 504. Above-mentioned processor 501, input equipment 502, output equipment 503, memory 504 are connected by bus 506.The input equipment 502 are used to receive the cleaning task to traditional payout data;The memory 504 is used for store program codes;The processor 501 are used to perform the program code of the memory storage to perform following operation:
When receiving the cleaning task to traditional payout data, the cleaning task is inserted into default tasks carrying table In, and set and the time is performed corresponding to the cleaning task;When reaching the execution time, obtained by oracle database Scheduling bag, wrapped according to the scheduling, data cleansing is carried out to policy information by the way of how concurrent;Declaration form after cleaning is believed Breath is submitted and stored into the oracle database in batches.
Further, it is described that policy information progress data cleansing is included by the way of how concurrent:
Concurrent allocation list is read, some concurrent processes are started according to the concurrent allocation list;
The remainder between the last figure of number of policy corresponding to policy information and concurrent process sum is asked for, by the declaration form Information distributes to concurrent process corresponding to the remainder;
Pending policy information is read using vernier, the policy information read is cached into the first preset group, And distributed concurrent process is committed in batches, by the concurrent process according to default data cleansing algorithm to the declaration form Information carries out data cleansing.
Further, the policy information by after cleaning is submitted and stored into the oracle database and wraps in batches Include:
Policy information after concurrent process is cleaned is read out in the second preset group, using commit orders in batches Policy information in second preset group is committed in oracle database;
Numbered according to the process for performing time and corresponding concurrent process of the policy information in every batch, by the batch Policy information is stored into oracle database in corresponding result table.
Further, the processor 501 is additionally operable to:
After some concurrent processes are started according to the concurrent allocation list, the cleaning task is obtained from log sheet Status information;
If the status information of the cleaning task no longer performs this cleaning task to run succeeded;
If the status information of the cleaning task is deleted processed in some concurrent processes to perform failure Data, and re-execute this cleaning task.
Further, the processor 501 is additionally operable to:
Before concurrent allocation list is read, a binary allocation list and policy information basic configuration table are obtained, according to described one Binary allocation list and policy information basic configuration table filter out policy information to be cleaned from the oracle database;
Read switch allocation list, the invalid number in the policy information to be cleaned is removed according to the switchgear distribution table According to.
It should be appreciated that in embodiments of the present invention, alleged processor 501 can be CPU (Central Processing Unit, CPU) and/or graphics processor (Graphic Processing Unit, GPU), can also be in this base Other general processors, digital signal processor (Digital Signal Processor, DSP), special integrated are combined on plinth Circuit (Application Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field- Programmable Gate Array, FPGA) either other PLDs, discrete gate or transistor logic, Discrete hardware components etc..
Input equipment 502 can include Trackpad, fingerprint adopt sensor (finger print information that is used to gathering user and fingerprint Directional information), microphone, communication module (such as Wi-Fi module, 2G/3G/4G mixed-media network modules mixed-medias), physical button etc..
Output equipment 503 can include display (LCD etc.), loudspeaker etc..Wherein, display can be used for display by user The information of input is supplied to information of user etc..Display may include display panel, optionally, can use liquid crystal display (Liquid Crystal Display, LCD), Organic Light Emitting Diode (Organic Light-Emitting Diode, ) etc. OLED form configures display panel.Further, above-mentioned Trackpad can cover over the display, when Trackpad detects After touch operation on or near it, processor 501 is sent to determine the type of touch event, is followed by subsequent processing device 501 Corresponding visual output is provided over the display according to the type of touch event.
In the specific implementation, processor 501, input equipment 502, output equipment 503 described in the embodiment of the present invention, depositing Reservoir 504 can perform the implementation described in the embodiment of the method for data cleansing provided in an embodiment of the present invention, herein Repeat no more.
Those of ordinary skill in the art are it is to be appreciated that the list of each example described with reference to the embodiments described herein Member and algorithm steps, it can be realized with the combination of electronic hardware or computer software and electronic hardware.These functions are actually Performed with hardware or software mode, application-specific and design constraint depending on technical scheme.Professional and technical personnel Described function can be realized using distinct methods to each specific application, but this realization is it is not considered that exceed The scope of the present invention.
It is apparent to those skilled in the art that for convenience and simplicity of description, the device of foregoing description With the specific work process of unit, the corresponding process in preceding method embodiment is may be referred to, will not be repeated here.
In several embodiments provided herein, it should be understood that disclosed method and terminal, it can be passed through Its mode is realized.For example, device embodiment described above is only schematical, for example, the module, unit are drawn Point, only a kind of division of logic function, there can be other dividing mode when actually realizing, such as multiple units or component can To combine or be desirably integrated into another system, or some features can be ignored, or not perform.It is another, it is shown or beg for The mutual coupling of opinion or direct-coupling or communication connection can be the INDIRECT COUPLINGs by some interfaces, device or unit Or communication connection, can be electrical, mechanical or other forms.
The unit illustrated as separating component can be or may not be physically separate, show as unit The part shown can be or may not be physical location, you can with positioned at a place, or can also be distributed to multiple On NE.Some or all of unit therein can be selected to realize the mesh of this embodiment scheme according to the actual needs 's.
In addition, each functional unit, module in each embodiment of the present invention can be integrated in a processing unit, Can be that unit, module are individually physically present, can also two or more units, module be integrated in a unit In.
If the function is realized in the form of SFU software functional unit and is used as independent production marketing or in use, can be with It is stored in a computer read/write memory medium.Based on such understanding, technical scheme is substantially in other words The part to be contributed to prior art or the part of the technical scheme can be embodied in the form of software product, the meter Calculation machine software product is stored in a storage medium, including some instructions are causing a computer equipment (can be People's computer, server, or network equipment etc.) perform all or part of step of each embodiment methods described of the present invention. And foregoing storage medium includes:USB flash disk, mobile hard disk, read-only storage (ROM, Read-Only Memory), arbitrary access are deposited Reservoir (RAM, Random Access Memory), magnetic disc or CD etc. are various can be with the medium of store program codes.
The foregoing is only a specific embodiment of the invention, but protection scope of the present invention is not limited thereto, any Those familiar with the art the invention discloses technical scope in, change or replacement can be readily occurred in, should all be contained Cover within protection scope of the present invention.Therefore, protection scope of the present invention described should be defined by scope of the claims.

Claims (10)

  1. A kind of 1. method of data cleansing, it is characterised in that methods described includes:
    When receiving the cleaning task to traditional payout data, the cleaning task is inserted in default tasks carrying table, And set and the time is performed corresponding to the cleaning task;
    When reaching the execution time, scheduling bag is obtained by oracle database, wrapped according to the scheduling, how concurrent use is Mode to policy information carry out data cleansing;
    Policy information after cleaning is submitted and stored into the oracle database in batches.
  2. 2. the method for data cleansing as claimed in claim 1, it is characterised in that described to be believed by the way of how concurrent declaration form Breath, which carries out data cleansing, to be included:
    Concurrent allocation list is read, some concurrent processes are started according to the concurrent allocation list;
    The remainder between the last figure of number of policy corresponding to policy information and concurrent process sum is asked for, by the policy information Distribute to concurrent process corresponding to the remainder;
    Pending policy information is read using vernier, the policy information read is cached into the first preset group, and point Batch is committed to distributed concurrent process, by the concurrent process according to default data cleansing algorithm to the policy information Carry out data cleansing.
  3. 3. the method for data cleansing as claimed in claim 1, it is characterised in that the policy information by after cleaning is in batches Submitting and storing to the oracle database includes:
    Policy information after concurrent process is cleaned is read out in the second preset group, in batches should using commit orders Policy information in second preset group is committed in oracle database;
    Numbered according to the process for performing time and corresponding concurrent process of the policy information in every batch, by the declaration form in the batch Information is stored into oracle database in corresponding result table.
  4. 4. the method for data cleansing as claimed in claim 2, it is characterised in that some being started according to the concurrent allocation list After bar concurrent process, methods described also includes:
    The status information of the cleaning task is obtained from log sheet;
    If the status information of the cleaning task no longer performs this cleaning task to run succeeded;
    If the status information of the cleaning task deletes the processed number in some concurrent processes to perform failure According to, and re-execute this cleaning task.
  5. 5. the method for data cleansing as claimed in claim 2, it is characterised in that before concurrent allocation list is read, the side Method also includes:
    A binary allocation list and policy information basic configuration table are obtained, is matched somebody with somebody substantially according to the binary allocation list and policy information Put table and policy information to be cleaned is filtered out from the oracle database;
    Read switch allocation list, the invalid data in the policy information to be cleaned is removed according to the switchgear distribution table.
  6. 6. a kind of terminal, it is characterised in that the terminal includes:
    Task receiving module, for when receiving the cleaning task to traditional payout data, the cleaning task being inserted pre- If tasks carrying table in, and set and the time performed corresponding to the cleaning task;
    Concurrent cleaning module, for when reaching the execution time, scheduling bag being obtained by oracle database, according to described Scheduling bag, data cleansing is carried out by the way of how concurrent to policy information;
    Memory module, for submitting and storing into the oracle database policy information after cleaning in batches.
  7. 7. terminal as claimed in claim 6, it is characterised in that the concurrent cleaning module includes:
    Start unit, for reading concurrent allocation list, some concurrent processes are started according to the concurrent allocation list;
    Allocation unit, for asking for the remainder between the last figure of number of policy corresponding to policy information and concurrent process sum, The policy information is distributed into concurrent process corresponding to the remainder;
    Cleaning unit, for reading pending policy information using vernier, the policy information read is cached to first pre- If in array, and being committed to distributed concurrent process in batches, by the concurrent process according to default data cleansing algorithm Data cleansing is carried out to the policy information.
  8. 8. terminal as claimed in claim 6, it is characterised in that the memory module includes:
    Unit is submitted, for the policy information after concurrent process is cleaned to be read out in the second preset group, using commit Policy information in second preset group is committed in oracle database by order in batches;
    Memory cell, will for being numbered according to the process for performing time and corresponding concurrent process of the policy information in every batch Policy information in the batch is stored into oracle database in corresponding result table.
  9. 9. terminal as claimed in claim 7, it is characterised in that the terminal also includes:
    State recognition module, for after some concurrent processes are started according to the concurrent allocation list, being obtained from log sheet Take the status information of the cleaning task;If it is clear no longer to perform this to run succeeded for the status information of the cleaning task Wash task;If the status information of the cleaning task is deleted processed in some concurrent processes to perform failure Data, and re-execute this cleaning task.
  10. 10. terminal as claimed in claim 7, it is characterised in that several terminals also include:
    Screening module, for before concurrent allocation list is read, obtaining a binary allocation list and policy information basic configuration table, root Declaration form to be cleaned is filtered out from the oracle database according to the binary allocation list and policy information basic configuration table Information;Read switch allocation list, the invalid data in the policy information to be cleaned is removed according to the switchgear distribution table.
CN201710221427.8A 2017-04-06 2017-04-06 Data cleaning method and terminal Active CN107688592B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201710221427.8A CN107688592B (en) 2017-04-06 2017-04-06 Data cleaning method and terminal
PCT/CN2018/074858 WO2018184418A1 (en) 2017-04-06 2018-01-31 Data cleaning method, terminal and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710221427.8A CN107688592B (en) 2017-04-06 2017-04-06 Data cleaning method and terminal

Publications (2)

Publication Number Publication Date
CN107688592A true CN107688592A (en) 2018-02-13
CN107688592B CN107688592B (en) 2020-03-17

Family

ID=61152355

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710221427.8A Active CN107688592B (en) 2017-04-06 2017-04-06 Data cleaning method and terminal

Country Status (2)

Country Link
CN (1) CN107688592B (en)
WO (1) WO2018184418A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111597180A (en) * 2020-05-19 2020-08-28 山东汇贸电子口岸有限公司 Data cleaning method of OTRS system based on storage process
CN112925772A (en) * 2019-12-06 2021-06-08 北京沃东天骏信息技术有限公司 Data dynamic splitting method and device

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112800043A (en) * 2021-02-05 2021-05-14 凯通科技股份有限公司 Internet of things terminal information extraction method, device, equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8060700B1 (en) * 2008-12-08 2011-11-15 Nvidia Corporation System, method and frame buffer logic for evicting dirty data from a cache using counters and data types
CN103593352A (en) * 2012-08-15 2014-02-19 阿里巴巴集团控股有限公司 Method and device for cleaning mass data
CN106202346A (en) * 2016-06-29 2016-12-07 浙江理工大学 A kind of data load and clean engine, dispatch and storage system
CN106294492A (en) * 2015-06-08 2017-01-04 深圳中兴网信科技有限公司 Data cleaning method and cleaning engine
CN106294745A (en) * 2016-08-10 2017-01-04 东方网力科技股份有限公司 Big data cleaning method and device

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103514205A (en) * 2012-06-27 2014-01-15 中国电信股份有限公司 Mass data processing method and system
CN103942104A (en) * 2014-04-23 2014-07-23 北京金山网络科技有限公司 Task managing method and device
CN105205105B (en) * 2015-08-27 2019-04-16 浪潮集团有限公司 A kind of ETL process system and processing method based on storm
CN105787008A (en) * 2016-02-23 2016-07-20 浪潮通用软件有限公司 Data deduplication cleaning method for large data volume
CN106484915B (en) * 2016-11-03 2019-10-11 国家电网公司信息通信分公司 A kind of cleaning method and system of mass data

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8060700B1 (en) * 2008-12-08 2011-11-15 Nvidia Corporation System, method and frame buffer logic for evicting dirty data from a cache using counters and data types
CN103593352A (en) * 2012-08-15 2014-02-19 阿里巴巴集团控股有限公司 Method and device for cleaning mass data
CN106294492A (en) * 2015-06-08 2017-01-04 深圳中兴网信科技有限公司 Data cleaning method and cleaning engine
CN106202346A (en) * 2016-06-29 2016-12-07 浙江理工大学 A kind of data load and clean engine, dispatch and storage system
CN106294745A (en) * 2016-08-10 2017-01-04 东方网力科技股份有限公司 Big data cleaning method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
赵鹏: ""基于软件总线模型的数据清洗系统的研究与实现"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112925772A (en) * 2019-12-06 2021-06-08 北京沃东天骏信息技术有限公司 Data dynamic splitting method and device
CN111597180A (en) * 2020-05-19 2020-08-28 山东汇贸电子口岸有限公司 Data cleaning method of OTRS system based on storage process

Also Published As

Publication number Publication date
WO2018184418A1 (en) 2018-10-11
CN107688592B (en) 2020-03-17

Similar Documents

Publication Publication Date Title
CN107231264A (en) For the method and apparatus for the capacity for managing Cloud Server
CN109285076A (en) Intelligent core protects processing method, server and storage medium
CN107748696A (en) The method and terminal device of a kind of task scheduling
CN109344153A (en) The processing method and terminal device of business datum
CN107688592A (en) The method and terminal of data cleansing
CN108874738A (en) Distributed parallel operation method, device, computer equipment and storage medium
CN104503831B (en) Equipment optimization method and device
CN109669933A (en) Transaction data intelligent processing method, device and computer readable storage medium
CN108255607A (en) Task processing method, device, electric terminal and readable storage medium storing program for executing
CN107784070A (en) A kind of method, apparatus and equipment for improving data cleansing efficiency
CN108183933A (en) Information push method, apparatus and system, electronic equipment and computer storage media
CN109309712A (en) Data transmission method, server and the storage medium called based on interface asynchronous
CN106156998A (en) The management method of a kind of Pending tasks and device
CN108376171A (en) Method, apparatus, terminal device and the storage medium that big data quickly introduces
CN107516158A (en) A kind of method for allocating tasks, device and terminal device
CN108804484A (en) The data measures and procedures for the examination and approval, equipment and computer readable storage medium
CN108153877A (en) Data dictionary methods of exhibiting, device, terminal device and storage medium
CN106639617A (en) Electronic worshipping system for cemetery
CN109344296A (en) Realize domain life cycle control method, system, server and the storage medium of the HASH key of Redis
CN109189790A (en) Data managing method, device, computer equipment and storage medium
CN109086289A (en) A kind of media data processing method, client, medium and equipment
CN110083457A (en) A kind of data capture method, device and data analysing method, device
CN110533396A (en) Material binding method, material binding device and terminal device
CN107993016A (en) Take control management method and terminal device
CN111221650A (en) System resource recovery method and device based on process type association

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant