CN107688592A - The method and terminal of data cleansing - Google Patents
The method and terminal of data cleansing Download PDFInfo
- Publication number
- CN107688592A CN107688592A CN201710221427.8A CN201710221427A CN107688592A CN 107688592 A CN107688592 A CN 107688592A CN 201710221427 A CN201710221427 A CN 201710221427A CN 107688592 A CN107688592 A CN 107688592A
- Authority
- CN
- China
- Prior art keywords
- policy information
- concurrent
- cleaning
- cleaning task
- information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/48—Program initiating; Program switching, e.g. by interrupt
- G06F9/4806—Task transfer initiation or dispatching
- G06F9/4843—Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
- G06F9/4881—Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Data Mining & Analysis (AREA)
- Quality & Reliability (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention is applied to field of computer technology, there is provided the method and terminal of a kind of data cleansing, methods described include:When receiving the cleaning task to traditional payout data, the cleaning task is inserted in default tasks carrying table, and sets and the time is performed corresponding to the cleaning task;When reaching the execution time, scheduling bag is obtained by oracle database, is wrapped according to the scheduling, data cleansing is carried out to policy information by the way of how concurrent;Policy information after cleaning is submitted and stored into the oracle database in batches.The step of mutually being changed present invention eliminates file format, solve process complexity, step redundancy, efficiency low problem when prior art is cleaned to traditional payout data, be effectively improved the efficiency of data cleansing, the totality for reducing data cleansing takes.
Description
Technical field
The invention belongs to the method and terminal of field of computer technology, more particularly to a kind of data cleansing.
Background technology
When being cleaned to traditional payout data, it is to be cleaned that prior art needs user to be downloaded in advance from database
Data, the data to be cleaned are generated into txt file, then calculated using prophet softwares.And prophet softwares
The file of generation needs to be again converted into txt file to be uploaded in database, and the operation efficiency of prophet softwares
It is low, often also to spend more than 12 hours in cleaning data the step.It can be seen that the cleaning process complexity of traditional payout data,
Step redundancy, efficiency are very low.
Therefore, it is necessary to a kind of new technical scheme is provided, to solve above-mentioned technical problem.
The content of the invention
In consideration of it, the embodiments of the invention provide a kind of method of data cleansing and terminal, to solve prior art to passing
Process complexity, step redundancy, efficiency low problem when system payout data is cleaned.
First aspect, there is provided a kind of method of data cleansing, methods described include:
When receiving the cleaning task to traditional payout data, the cleaning task is inserted into default tasks carrying table
In, and set and the time is performed corresponding to the cleaning task;
When reaching the execution time, scheduling bag is obtained by oracle database, wrapped according to the scheduling, using more
Concurrent mode carries out data cleansing to policy information;
Policy information after cleaning is submitted and stored into the oracle database in batches.
Further, it is described that policy information progress data cleansing is included by the way of how concurrent:
Concurrent allocation list is read, some concurrent processes are started according to the concurrent allocation list;
The remainder between the last figure of number of policy corresponding to policy information and concurrent process sum is asked for, by the declaration form
Information distributes to concurrent process corresponding to the remainder;
Pending policy information is read using vernier, the policy information read is cached into the first preset group,
And distributed concurrent process is committed in batches, by the concurrent process according to default data cleansing algorithm to the declaration form
Information carries out data cleansing.
Further, the policy information by after cleaning is submitted and stored into the oracle database and wraps in batches
Include:
Policy information after concurrent process is cleaned is read out in the second preset group, using commit orders in batches
Policy information in second preset group is committed in oracle database;
Numbered according to the process for performing time and corresponding concurrent process of the policy information in every batch, by the batch
Policy information is stored into oracle database in corresponding result table.
Further, after some concurrent processes are started according to the concurrent allocation list, methods described also includes:
The status information of the cleaning task is obtained from log sheet;
If the status information of the cleaning task no longer performs this cleaning task to run succeeded;
If the status information of the cleaning task is deleted processed in some concurrent processes to perform failure
Data, and re-execute this cleaning task.
Further, before concurrent allocation list is read, methods described also includes:
A binary allocation list and policy information basic configuration table are obtained, according to the binary allocation list and policy information base
This allocation list filters out policy information to be cleaned from the oracle database;
Read switch allocation list, the invalid number in the policy information to be cleaned is removed according to the switchgear distribution table
According to.
Second aspect, there is provided a kind of terminal, the terminal include:
Task receiving module, for when receiving the cleaning task to traditional payout data, the cleaning task to be inserted
Enter in default tasks carrying table, and set and the time is performed corresponding to the cleaning task;
Concurrent cleaning module, for when reaching the execution time, scheduling bag to be obtained by oracle database, according to
The scheduling bag, data cleansing is carried out by the way of how concurrent to policy information;
Memory module, for submitting and storing into the oracle database policy information after cleaning in batches.
Compared with prior art, the embodiment of the present invention is by when receiving the cleaning task to traditional payout data, inciting somebody to action
The cleaning task is inserted in default tasks carrying table, and sets and the time is performed corresponding to the cleaning task;Held when described
When the row time reaches, then scheduling bag is obtained by oracle database, wrapped according to the scheduling, to protecting by the way of how concurrent
Single information carries out data cleansing;Finally the policy information after cleaning is submitted and stored to the oracle database in batches
In, so as to eliminate the step of file format is mutually changed, the efficiency of data cleansing is effectively improved, reduces data cleansing
Totality take.
Brief description of the drawings
In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing
There is the required accompanying drawing used in technology description to be briefly described, it should be apparent that, drawings in the following description are only this
Some embodiments of invention, for those of ordinary skill in the art, on the premise of not paying creative work, can be with
Other accompanying drawings are obtained according to these accompanying drawings.
Fig. 1 is the implementation process figure of the method for the data cleansing that first embodiment of the invention provides;
Fig. 2 be first embodiment of the invention provide data cleansing method in step S102 specific implementation flow chart;
Fig. 3 be first embodiment of the invention provide data cleansing method in step S103 specific implementation flow chart;
Fig. 4 is the schematic block diagram for the terminal that second embodiment of the invention provides;
Fig. 5 is the schematic block diagram for the terminal that third embodiment of the invention provides.
Embodiment
In order to make the purpose , technical scheme and advantage of the present invention be clearer, it is right below in conjunction with drawings and Examples
The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and
It is not used in the restriction present invention.
The embodiment of the present invention is by the way that when receiving the cleaning task to traditional payout data, the cleaning task is inserted
In default tasks carrying table, and set and the time is performed corresponding to the cleaning task;When reaching the execution time, then lead to
Cross oracle database and obtain scheduling bag, wrapped according to the scheduling, it is clear to carry out data to policy information by the way of how concurrent
Wash;Finally the policy information after cleaning is submitted and stored into the oracle database in batches, so as to eliminate file
The step of form is mutually changed, the efficiency of data cleansing is effectively improved, the totality for reducing data cleansing takes.The present invention
Embodiment additionally provides corresponding terminal, is described in detail individually below.
Fig. 1 is the implementation process of the method for the data cleansing that first embodiment of the invention provides.
In embodiments of the present invention, the method for the data cleansing is applied in terminal, and the terminal includes but is not limited to
Computer, server etc..Refering to Fig. 1, the method for the data cleansing includes:
In step S101, when receiving the cleaning task to traditional payout data, the cleaning task is inserted pre-
If tasks carrying table in, and set and the time performed corresponding to the cleaning task.
In embodiments of the present invention, terminal is obtained to the clear of traditional payout data according to trigger action of the user on the page
Task is washed, the cleaning task is inserted into default tasks carrying table, and according to the execution of user's operating and setting task
Time.Exemplarily, the tasks carrying table can be pala_batch_plan tables, be wrapped in the pala_batch_plan tables
Include the scheduled start date fields that the time is performed for controlling.The embodiment of the present invention is inserted by the cleaning task
While entering to the pala_batch_plan tables, the value in the scheduled start date fields is changed, to set
The execution time of the cleaning task.After the execution time is set up, only it is more than or equal to the execution time in current time
When, the cleaning task just allows to start, and arranges cleaning task so as to facilitate user, is advantageous to reasonable employment cpu resources.
In step s 102, when reaching the execution time, scheduling bag is obtained by oracle database, according to described
Scheduling bag, data cleansing is carried out by the way of how concurrent to policy information.
When reaching the execution time, the embodiment of the present invention obtains scheduling bag, the scheduling by oracle database
Include the information related to cleaning task execution such as parameter information, procedure information in bag.Then according in the scheduling bag
Relevant information, start several concurrent processes, the policy information in traditional payout data is entered by the concurrent process
Row data cleansing.
Exemplarily, as it was previously stated, the oracle database is according in the scheduled start date fields
The execution time, it is when reaching the execution time, then automatic to obtain scheduling bag, start multiple concurrent processes and perform cleaning automatically
Task.
Compared with prior art, cleaning process is optimized the embodiment of the present invention, including by the way of how concurrent
Data cleansing is carried out, pending policy information is selected by each concurrent process Automatic sieve, so as to ensure that policy information will not
It is repeatedly executed.
Alternatively, Fig. 2 shows being carried out by the way of how concurrent to policy information for first embodiment of the invention offer
The specific implementation flow of data cleansing.It is described that data cleansing bag is carried out to policy information by the way of how concurrent refering to Fig. 2
Include:
In step s 201, concurrent allocation list is read, some concurrent processes are started according to the concurrent allocation list.
In embodiments of the present invention, the concurrent allocation list is used for the quantity of business personnel's configuration concurrency process.Terminal can root
Start the concurrent process of respective numbers according to the concurrent allocation list, think that data cleansing is prepared.
Alternatively, the warming-up exercise before concurrent allocation list is read also includes the screening to policy information, methods described
It can also include:
A binary allocation list and policy information basic configuration table are obtained, according to the binary allocation list and policy information base
This allocation list filters out policy information to be cleaned from the oracle database;
Read switch allocation list, the invalid number in the policy information to be cleaned is removed according to the switchgear distribution table
According to.
In the embodiment of the present invention, different policy informations may be produced by different marketing organizations, in the embodiment of the present invention
The marketing organization is divided for first-level machine structure and secondary facility according to administrative division.The one binary allocation list, which is used to distinguish, to be protected
Which marketing organization is single information belong to.The declaration form essential information allocation list then needs for recording which of policy information insurance kind
Carry out data cleansing.The embodiment of the present invention goes out with reference to the binary allocation list and policy information basic configuration table preliminary screening
Policy information to be cleaned, operated with the data cleansing reduced to invalid policy information.One binary allocation list and the declaration form letter
Breath basic configuration table also serves as the basis of follow-up data cleaning.
The switchgear distribution table is by business personnel's dynamic configuration, for business personnel's recording needle in each policy information
Invalid data.Terminal filters out the unconcerned invalid data of business personnel according to the switchgear distribution table and is inserted into from policy information
In data statistic, the operation of next step is carried out with standby service personnel.For example include insurer's name in certain class policy information
Field, age field, sex field, phone field, the information of academic information field, it is such declaration form in the switchgear distribution table
Insurer's name field, age field, sex field in information, phone field, academic information field are provided with switch option.
If business personnel thinks that academic information field is extraneous data, the learning information word can be closed in the switchgear distribution table
The option of section, terminal then only read insurer's name field in such policy information, age word according to the switchgear distribution table
Section, sex field, the information of phone field.
Herein, the embodiment of the present invention is based on the binary allocation list, policy information basic configuration table and switched to match somebody with somebody
Put table and filter out pending policy information, eliminated in advance in the cleaning preparatory stage policy information that need not be cleaned and
Unrelated data, and then reduce the workload of cleaning, be advantageous to further improve the efficiency of data cleansing.
Alternatively, warming-up exercise can also include shape after some concurrent processes are started according to the concurrent allocation list
State judges that methods described also includes:
The status information of the cleaning task is obtained from log sheet;
If the status information of the cleaning task no longer performs this cleaning task to run succeeded;
If the status information of the cleaning task is deleted processed in some concurrent processes to perform failure
Data, and re-execute this cleaning task.
Herein, the embodiment of the present invention is by differentiating the status information of cleaning task corresponding to current execution time, in institute
When stating cleaning task and running succeeded, this cleaning task is no longer performed;Failure, which is performed, in the cleaning task then deletes simultaneously hair line
The data that journey was run, jump to step S202 and re-execute cleaning task;When the cleaning task is not carried out, then jump to
Step S202 performs cleaning task;Cleaning task is repeated so as to avoid, advantageously reduces consuming and the CPU of time
The consumption of resource.
In step S202, ask for remaining between the last figure of number of policy corresponding to policy information and concurrent process sum
Number, concurrent process corresponding to the remainder is distributed to by the policy information.
The embodiment of the present invention is that each concurrent process is provided with corresponding process numbering.Filtering out pending declaration form letter
After breath, number of policy according to corresponding to policy information of the embodiment of the present invention distributes treatment progress corresponding to the policy information.
First, pending policy information and its corresponding number of policy are obtained;Then ask for the last figure of the number of policy with it is concurrent
Remainder between process sum, finally according to the remainder by the policy information distribute to process numbering be the remainder and
Hair process, the concurrent process are the treatment progress of the policy information.Exemplarily, if currently pending number of policy is
201702008, the concurrent process sum started is 3, and numbering is 0,1,2 respectively;Then the last figure 8 of the number of policy with it is concurrent
Remainder between process sum 3 is 2, then by the policy information that the number of policy is 201702008 distribute to process numbering be 2 and
Hair process.The like, for treatment progress corresponding to the distribution of institute pending processing policy information, so as to ensure that each declaration form
Information has corresponding treatment progress, avoids policy information by the situation of repeated washing, is advantageous to improve the effect of data cleansing
Rate.
In step S203, pending policy information is read using vernier, the policy information read is cached to
In one preset group, and distributed concurrent process is committed in batches, by the concurrent process according to default data cleansing
Algorithm carries out data cleansing to the policy information.
Herein, the embodiment of the present invention reads pending policy information using vernier from oracle database.Often read
Take a data information to be first put into the first preset group to cache, when the bar number of reading reaches specified threshold, then will be read
Data message as a batch, be committed in the lump corresponding to concurrent process handled, by the concurrent process according to pre-
If data cleansing algorithm carry out data cleansing.Alternatively, the specified threshold can be 5000/batch.Specific code is such as
Under:
FETCH c_pol_ind BULK COLLECT INTO v_pol_ind LIMIT 5000;
After 5000 datas are read, then the 5000 data information is carried in the lump from first preset group
During friendship is cleaned to concurrent process, the consumption of I/O ports is advantageously reduced.
In step s 103, the policy information after cleaning is submitted and stored into the oracle database in batches.
Alternatively, Fig. 3 show first embodiment of the invention provide data cleansing method in step S103 it is specific
Implementation process.Refering to Fig. 3, the step S103 includes:
In step S301, the policy information after concurrent process is cleaned is read out in the second preset group, used
Policy information in second preset group is committed in oracle database by commit orders in batches.
After concurrent process is completed to the cleaning of data message, the embodiment of the present invention reads warp from the concurrent process
The data message of over cleaning, and the data message is cached into the second preset group.Similarly, when the bar number of reading reaches
During specified threshold, then using the data message in second preset group as a batch, it is committed to oracle database and enters
Row storage.Alternatively, the specified threshold can be 5000/batch.After circulation reads 5000 datas, then by described in
5000 data information read out to oracle database from second preset group and stored in the lump, can further subtract
The consumption of few I/O ports.
In step s 302, compiled according to the process for performing time and corresponding concurrent process of the policy information in every batch
Number, the policy information in the batch is stored into oracle database in corresponding result table.
In embodiments of the present invention, the oracle database includes according to the one-level subregion for performing time division and pressed
Shine into the secondary partition of journey numbering division.As a result table includes two proc date fields for being used to determine the affiliated subregion of data
With Order num fields, the proc date fields represent to perform the time, and the Order num fields represent process numbering.
The every a data information read from concurrent process all possesses two attribute letters of proc date fields and Order num fields
Breath.When the data message of a batch is committed to the oracle database by commit orders, then according to per a data
Proc date fields corresponding to information and Order num fields exactly can store the data information to corresponding knot
Fruit table.The embodiment of the present invention is stored by the way of subregion to the policy information after cleaning, when both having facilitated to history
Between on policy information deleted, also improve to this cleaning policy information search efficiency.
It should be understood that in the above-described embodiments, the size of the sequence number of each step is not meant to the priority of execution sequence, each step
Rapid execution sequence should determine that the implementation process without tackling the embodiment of the present invention forms any limit with its function and internal logic
It is fixed.
Fig. 4 shows the schematic block diagram for the terminal that second embodiment of the invention provides, and for convenience of description, illustrate only
The part related to the embodiment of the present invention.
In embodiments of the present invention, the terminal is used to realize that above-mentioned Fig. 1 is clear to the data described in Fig. 3 any embodiments
The method washed, can be the unit of software unit, hardware cell or software and hardware combining.The terminal includes but is not limited to calculate
Machine, server etc..
Refering to Fig. 4, the terminal includes:
Task receiving module 41, for when receiving the cleaning task to traditional payout data, by the cleaning task
Insert in default tasks carrying table, and set and the time is performed corresponding to the cleaning task;
Concurrent cleaning module 42, for when reaching the execution time, obtaining scheduling bag by oracle database, pressing
According to the scheduling bag, data cleansing is carried out to policy information by the way of how concurrent;
Memory module 43, for submitting and storing to the oracle database policy information after cleaning in batches
In.
In embodiments of the present invention, the task receiving module 41 is obtained to passing according to trigger action of the user on the page
The cleaning task of system payout data, and the cleaning task is inserted into default tasks carrying table, set holding for the task
The row time.Exemplarily, the tasks carrying table can be pala_batch_plan tables, in the pala_batch_plan tables
Include the scheduled start date fields that the time is performed for controlling.The embodiment of the present invention is by the cleaning task
While being inserted into the pala_batch_plan tables, operated according to user and change the scheduled start date words
Value in section, to set the execution time of the cleaning task.After the execution time is set up, only it is more than or waits in current time
When the execution time, the cleaning task just allows to start, and arranges cleaning task so as to facilitate user, is advantageous to rationally make
With cpu resources.
When reaching the execution time, the concurrent cleaning module 42 then obtains scheduling bag by oracle database.
Wherein, the information related to cleaning task execution such as parameter information, procedure information is included in the scheduling bag.Then according to institute
The relevant information in scheduling bag is stated, starts several concurrent processes, line number is entered to the policy information by the concurrent process
According to cleaning.
Exemplarily, as it was previously stated, the concurrent cleaning module 42 can be according to the scheduled start date words
It is the execution time in section, when reaching the execution time, then automatic to obtain scheduling bag, start several concurrent processes and hold automatically
Row cleaning task.
Compared with prior art, cleaning process is optimized the embodiment of the present invention, including by the way of how concurrent
Data cleansing is carried out, pending policy information is selected by each concurrent process Automatic sieve, so as to ensure that policy information will not
It is repeatedly executed.The concurrent cleaning module 42 also includes:
Start unit 421, for reading concurrent allocation list, some concurrent processes are started according to the concurrent allocation list;
Allocation unit 422, for asking between the last figure of number of policy corresponding to policy information and concurrent process sum
Remainder, the policy information is distributed into concurrent process corresponding to the remainder;
Cleaning unit 423, for reading pending policy information using vernier, by the policy information read cache to
In first preset group, and distributed concurrent process is committed in batches, it is clear according to default data by the concurrent process
Wash algorithm and data cleansing is carried out to the policy information.
In embodiments of the present invention, the concurrent allocation list is used for the quantity of business personnel's configuration concurrency process.Terminal can root
Start the concurrent process of respective numbers according to the concurrent allocation list, think that data cleansing is prepared.
Alternatively, warming-up exercise of the embodiment of the present invention before concurrent allocation list is read can also include to policy information
Screening, the terminal also includes:
Screening module 44, for before concurrent allocation list is read, one binary allocation list of acquisition and policy information to be matched somebody with somebody substantially
Table is put, is filtered out according to the binary allocation list and policy information basic configuration table from the oracle database to be cleaned
Policy information;Read switch allocation list, it is invalid in the policy information to be cleaned to be removed according to the switchgear distribution table
Data.
In the embodiment of the present invention, different policy informations may be produced by different marketing organizations, in the embodiment of the present invention
The marketing organization is divided for first-level machine structure and secondary facility according to administrative division.The one binary allocation list, which is used to distinguish, to be protected
Which marketing organization is single information belong to.The declaration form essential information allocation list then needs for recording which of policy information insurance kind
Carry out data cleansing.The embodiment of the present invention is based on the binary allocation list and policy information basic configuration table preliminary screening goes out
Policy information to be cleaned, advantageously reduce the data cleansing operation to invalid policy information.The one binary allocation list and guarantor
Single information basic configuration table also serves as the basis of follow-up data cleaning.
The switchgear distribution table is by business personnel's dynamic configuration, for business personnel's protocol failure data.Terminal is opened according to this
Close allocation list to filter out the unconcerned invalid data of business personnel from policy information and be inserted into data statistic, with standby service
Personnel carry out the operation of next step.The embodiment of the present invention be based on the binary allocation list, policy information basic configuration table and
Switchgear distribution table filters out pending policy information, excludes the declaration form letter that need not be cleaned in advance in the cleaning preparatory stage
Breath, and then reduce the workload of cleaning, be advantageous to further improve the efficiency of data cleansing.
Alternatively, the warming-up exercise after some concurrent processes are started according to the concurrent allocation list can also include
Condition adjudgement, the terminal also include:
State recognition module 45, for after some concurrent processes are started according to the concurrent allocation list, from daily record
The status information of the cleaning task is obtained in table;If the status information of the cleaning task no longer performs to run succeeded
This cleaning task;If the status information of the cleaning task is deleted in some concurrent processes to perform failure
Reduced data, and re-execute this cleaning task.
Herein, the embodiment of the present invention is by differentiating the status information of cleaning task corresponding to current execution time, in institute
When stating cleaning task and running succeeded, this cleaning task is no longer performed;Failure, which is performed, in the cleaning task then deletes simultaneously hair line
The data that journey was run, jump to allocation unit 422 and re-execute cleaning task;When the cleaning task is not carried out, then jump
Go to allocation unit 422 and perform cleaning task;Cleaning task is repeated so as to avoid, advantageously reduces the consumption of time
Take and the consumption of cpu resource.
For the pending policy information, the embodiment of the present invention is by the allocation unit 422 according to policy information pair
The number of policy answered distributes treatment progress corresponding to the policy information.First, allocation unit 422 obtains pending declaration form letter
Breath and its corresponding number of policy;Then the remainder between the last figure of the number of policy and concurrent process sum is asked for, finally
The policy information is distributed to by the concurrent process that process numbering is the remainder according to the remainder, the concurrent process is institute
State the treatment progress of policy information.Exemplarily, if currently pending number of policy is 201702008, what is started is concurrent
Process sum is 3, and numbering is 0,1,2 respectively;Then the remainder between the last figure 8 of the number of policy and concurrent process sum 3 is
2, then the policy information that the number of policy is 201702008 is distributed to the concurrent process for being 2 to process numbering.The like, for institute
Pending processing policy information distribution corresponding to treatment progress, so as to ensure that each policy information have corresponding to handle into
Journey, policy information is avoided by the situation of repeated washing, be advantageous to improve the efficiency of data cleansing.
After course allocation is completed, the embodiment of the present invention uses vernier from Oracle data by the cleaning unit 423
Pending policy information is read in storehouse.Often reading a data information, which is first put into the first preset group, caches, when reading
When bar number reaches specified threshold, then using the data message read as a batch, corresponding concurrent process is committed in the lump
Handled, data cleansing is carried out according to default data cleansing algorithm by the concurrent process.Alternatively, the specified threshold
It can be 5000/batch.Specific code is as follows:
FETCH c_pol_ind BULK COLLECT INTO v_pol_ind LIMIT 5000;
After 5000 datas are read, then the 5000 data information is carried in the lump from first preset group
During friendship is cleaned to concurrent process, to reduce the consumption of I/O ports.
Further, the memory module 43 also includes:
Unit 431 is submitted, for the policy information after concurrent process is cleaned to be read out in the second preset group, is used
Policy information in second preset group is committed in oracle database by commit orders in batches;
Memory cell 432, for the process for performing time and corresponding concurrent process according to the policy information in every batch
Numbering, the policy information in the batch is stored into oracle database in corresponding result table.
After concurrent process is completed to the cleaning of data message, the embodiment of the present invention submits unit 431 from institute by described
Data message of the reading through over cleaning in concurrent process is stated, and the data message is cached into the second preset group.Equally
Ground, when the bar number of reading reaches specified threshold, then using the data message read as a batch, it is committed to Oracle numbers
Stored according to storehouse.Alternatively, the specified threshold can be 5000/batch.After circulation reads 5000 datas, then
The 5000 data information is read out into oracle database in the lump from second preset group to be stored, to reduce
The consumption of I/O ports.
In embodiments of the present invention, the oracle database includes according to the one-level subregion for performing time division and pressed
Shine into the secondary partition of journey numbering division.As a result table includes two proc date fields for being used to determine the affiliated subregion of data
With Order num fields, the proc date fields represent to perform the time, and the Order num fields represent process numbering.
The every a data information read from concurrent process all possesses two attribute letters of proc date fields and Order num fields
Breath.When the data message of a batch is committed to the oracle database by commit orders, the memory cell 432
According to proc date fields corresponding to every a data information and Order num fields, you can exactly deposit the data information
Store up to corresponding result table.The embodiment of the present invention is stored by the way of subregion to the policy information after cleaning, both side
The policy information on historical time is deleted, also improve to this cleaning policy information search efficiency.
It should be noted that the terminal in the embodiment of the present invention can be used for realizing whole skills in above method embodiment
Art scheme, the function of its each functional module can be implemented according to the method in above method embodiment, and it is implemented
Process can refer to the associated description in examples detailed above, and here is omitted.
In summary, the embodiment of the present invention, will be described clear by when receiving the cleaning task to traditional payout data
Wash task to insert in default tasks carrying table, and set and the time is performed corresponding to the cleaning task;When the execution time
During arrival, then scheduling bag is obtained by oracle database, wrapped according to the scheduling, to policy information by the way of how concurrent
Carry out data cleansing;Finally the policy information after cleaning is submitted and stored into the oracle database in batches, so as to
The step of file format is mutually changed is eliminated, is effectively improved the efficiency of data cleansing, reduces the totality of data cleansing
It is time-consuming.
For the ease of preferably implementing the above method embodiment in the embodiment of the present invention, present invention also offers for
Close the associated terminal for implementing to perform above method embodiment.Fig. 5 provides the schematic of the terminal of third embodiment of the invention offer
Block diagram.The terminal as depicted can include:One or more processors 501 (only show one) in figure;It is one or more
Input equipment 502 (only shows one) in figure, one or more output equipments 503 (one is only shown in figure), memory 504.
Above-mentioned processor 501, input equipment 502, output equipment 503, memory 504 are connected by bus 506.The input equipment
502 are used to receive the cleaning task to traditional payout data;The memory 504 is used for store program codes;The processor
501 are used to perform the program code of the memory storage to perform following operation:
When receiving the cleaning task to traditional payout data, the cleaning task is inserted into default tasks carrying table
In, and set and the time is performed corresponding to the cleaning task;When reaching the execution time, obtained by oracle database
Scheduling bag, wrapped according to the scheduling, data cleansing is carried out to policy information by the way of how concurrent;Declaration form after cleaning is believed
Breath is submitted and stored into the oracle database in batches.
Further, it is described that policy information progress data cleansing is included by the way of how concurrent:
Concurrent allocation list is read, some concurrent processes are started according to the concurrent allocation list;
The remainder between the last figure of number of policy corresponding to policy information and concurrent process sum is asked for, by the declaration form
Information distributes to concurrent process corresponding to the remainder;
Pending policy information is read using vernier, the policy information read is cached into the first preset group,
And distributed concurrent process is committed in batches, by the concurrent process according to default data cleansing algorithm to the declaration form
Information carries out data cleansing.
Further, the policy information by after cleaning is submitted and stored into the oracle database and wraps in batches
Include:
Policy information after concurrent process is cleaned is read out in the second preset group, using commit orders in batches
Policy information in second preset group is committed in oracle database;
Numbered according to the process for performing time and corresponding concurrent process of the policy information in every batch, by the batch
Policy information is stored into oracle database in corresponding result table.
Further, the processor 501 is additionally operable to:
After some concurrent processes are started according to the concurrent allocation list, the cleaning task is obtained from log sheet
Status information;
If the status information of the cleaning task no longer performs this cleaning task to run succeeded;
If the status information of the cleaning task is deleted processed in some concurrent processes to perform failure
Data, and re-execute this cleaning task.
Further, the processor 501 is additionally operable to:
Before concurrent allocation list is read, a binary allocation list and policy information basic configuration table are obtained, according to described one
Binary allocation list and policy information basic configuration table filter out policy information to be cleaned from the oracle database;
Read switch allocation list, the invalid number in the policy information to be cleaned is removed according to the switchgear distribution table
According to.
It should be appreciated that in embodiments of the present invention, alleged processor 501 can be CPU (Central
Processing Unit, CPU) and/or graphics processor (Graphic Processing Unit, GPU), can also be in this base
Other general processors, digital signal processor (Digital Signal Processor, DSP), special integrated are combined on plinth
Circuit (Application Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field-
Programmable Gate Array, FPGA) either other PLDs, discrete gate or transistor logic,
Discrete hardware components etc..
Input equipment 502 can include Trackpad, fingerprint adopt sensor (finger print information that is used to gathering user and fingerprint
Directional information), microphone, communication module (such as Wi-Fi module, 2G/3G/4G mixed-media network modules mixed-medias), physical button etc..
Output equipment 503 can include display (LCD etc.), loudspeaker etc..Wherein, display can be used for display by user
The information of input is supplied to information of user etc..Display may include display panel, optionally, can use liquid crystal display
(Liquid Crystal Display, LCD), Organic Light Emitting Diode (Organic Light-Emitting Diode,
) etc. OLED form configures display panel.Further, above-mentioned Trackpad can cover over the display, when Trackpad detects
After touch operation on or near it, processor 501 is sent to determine the type of touch event, is followed by subsequent processing device 501
Corresponding visual output is provided over the display according to the type of touch event.
In the specific implementation, processor 501, input equipment 502, output equipment 503 described in the embodiment of the present invention, depositing
Reservoir 504 can perform the implementation described in the embodiment of the method for data cleansing provided in an embodiment of the present invention, herein
Repeat no more.
Those of ordinary skill in the art are it is to be appreciated that the list of each example described with reference to the embodiments described herein
Member and algorithm steps, it can be realized with the combination of electronic hardware or computer software and electronic hardware.These functions are actually
Performed with hardware or software mode, application-specific and design constraint depending on technical scheme.Professional and technical personnel
Described function can be realized using distinct methods to each specific application, but this realization is it is not considered that exceed
The scope of the present invention.
It is apparent to those skilled in the art that for convenience and simplicity of description, the device of foregoing description
With the specific work process of unit, the corresponding process in preceding method embodiment is may be referred to, will not be repeated here.
In several embodiments provided herein, it should be understood that disclosed method and terminal, it can be passed through
Its mode is realized.For example, device embodiment described above is only schematical, for example, the module, unit are drawn
Point, only a kind of division of logic function, there can be other dividing mode when actually realizing, such as multiple units or component can
To combine or be desirably integrated into another system, or some features can be ignored, or not perform.It is another, it is shown or beg for
The mutual coupling of opinion or direct-coupling or communication connection can be the INDIRECT COUPLINGs by some interfaces, device or unit
Or communication connection, can be electrical, mechanical or other forms.
The unit illustrated as separating component can be or may not be physically separate, show as unit
The part shown can be or may not be physical location, you can with positioned at a place, or can also be distributed to multiple
On NE.Some or all of unit therein can be selected to realize the mesh of this embodiment scheme according to the actual needs
's.
In addition, each functional unit, module in each embodiment of the present invention can be integrated in a processing unit,
Can be that unit, module are individually physically present, can also two or more units, module be integrated in a unit
In.
If the function is realized in the form of SFU software functional unit and is used as independent production marketing or in use, can be with
It is stored in a computer read/write memory medium.Based on such understanding, technical scheme is substantially in other words
The part to be contributed to prior art or the part of the technical scheme can be embodied in the form of software product, the meter
Calculation machine software product is stored in a storage medium, including some instructions are causing a computer equipment (can be
People's computer, server, or network equipment etc.) perform all or part of step of each embodiment methods described of the present invention.
And foregoing storage medium includes:USB flash disk, mobile hard disk, read-only storage (ROM, Read-Only Memory), arbitrary access are deposited
Reservoir (RAM, Random Access Memory), magnetic disc or CD etc. are various can be with the medium of store program codes.
The foregoing is only a specific embodiment of the invention, but protection scope of the present invention is not limited thereto, any
Those familiar with the art the invention discloses technical scope in, change or replacement can be readily occurred in, should all be contained
Cover within protection scope of the present invention.Therefore, protection scope of the present invention described should be defined by scope of the claims.
Claims (10)
- A kind of 1. method of data cleansing, it is characterised in that methods described includes:When receiving the cleaning task to traditional payout data, the cleaning task is inserted in default tasks carrying table, And set and the time is performed corresponding to the cleaning task;When reaching the execution time, scheduling bag is obtained by oracle database, wrapped according to the scheduling, how concurrent use is Mode to policy information carry out data cleansing;Policy information after cleaning is submitted and stored into the oracle database in batches.
- 2. the method for data cleansing as claimed in claim 1, it is characterised in that described to be believed by the way of how concurrent declaration form Breath, which carries out data cleansing, to be included:Concurrent allocation list is read, some concurrent processes are started according to the concurrent allocation list;The remainder between the last figure of number of policy corresponding to policy information and concurrent process sum is asked for, by the policy information Distribute to concurrent process corresponding to the remainder;Pending policy information is read using vernier, the policy information read is cached into the first preset group, and point Batch is committed to distributed concurrent process, by the concurrent process according to default data cleansing algorithm to the policy information Carry out data cleansing.
- 3. the method for data cleansing as claimed in claim 1, it is characterised in that the policy information by after cleaning is in batches Submitting and storing to the oracle database includes:Policy information after concurrent process is cleaned is read out in the second preset group, in batches should using commit orders Policy information in second preset group is committed in oracle database;Numbered according to the process for performing time and corresponding concurrent process of the policy information in every batch, by the declaration form in the batch Information is stored into oracle database in corresponding result table.
- 4. the method for data cleansing as claimed in claim 2, it is characterised in that some being started according to the concurrent allocation list After bar concurrent process, methods described also includes:The status information of the cleaning task is obtained from log sheet;If the status information of the cleaning task no longer performs this cleaning task to run succeeded;If the status information of the cleaning task deletes the processed number in some concurrent processes to perform failure According to, and re-execute this cleaning task.
- 5. the method for data cleansing as claimed in claim 2, it is characterised in that before concurrent allocation list is read, the side Method also includes:A binary allocation list and policy information basic configuration table are obtained, is matched somebody with somebody substantially according to the binary allocation list and policy information Put table and policy information to be cleaned is filtered out from the oracle database;Read switch allocation list, the invalid data in the policy information to be cleaned is removed according to the switchgear distribution table.
- 6. a kind of terminal, it is characterised in that the terminal includes:Task receiving module, for when receiving the cleaning task to traditional payout data, the cleaning task being inserted pre- If tasks carrying table in, and set and the time performed corresponding to the cleaning task;Concurrent cleaning module, for when reaching the execution time, scheduling bag being obtained by oracle database, according to described Scheduling bag, data cleansing is carried out by the way of how concurrent to policy information;Memory module, for submitting and storing into the oracle database policy information after cleaning in batches.
- 7. terminal as claimed in claim 6, it is characterised in that the concurrent cleaning module includes:Start unit, for reading concurrent allocation list, some concurrent processes are started according to the concurrent allocation list;Allocation unit, for asking for the remainder between the last figure of number of policy corresponding to policy information and concurrent process sum, The policy information is distributed into concurrent process corresponding to the remainder;Cleaning unit, for reading pending policy information using vernier, the policy information read is cached to first pre- If in array, and being committed to distributed concurrent process in batches, by the concurrent process according to default data cleansing algorithm Data cleansing is carried out to the policy information.
- 8. terminal as claimed in claim 6, it is characterised in that the memory module includes:Unit is submitted, for the policy information after concurrent process is cleaned to be read out in the second preset group, using commit Policy information in second preset group is committed in oracle database by order in batches;Memory cell, will for being numbered according to the process for performing time and corresponding concurrent process of the policy information in every batch Policy information in the batch is stored into oracle database in corresponding result table.
- 9. terminal as claimed in claim 7, it is characterised in that the terminal also includes:State recognition module, for after some concurrent processes are started according to the concurrent allocation list, being obtained from log sheet Take the status information of the cleaning task;If it is clear no longer to perform this to run succeeded for the status information of the cleaning task Wash task;If the status information of the cleaning task is deleted processed in some concurrent processes to perform failure Data, and re-execute this cleaning task.
- 10. terminal as claimed in claim 7, it is characterised in that several terminals also include:Screening module, for before concurrent allocation list is read, obtaining a binary allocation list and policy information basic configuration table, root Declaration form to be cleaned is filtered out from the oracle database according to the binary allocation list and policy information basic configuration table Information;Read switch allocation list, the invalid data in the policy information to be cleaned is removed according to the switchgear distribution table.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710221427.8A CN107688592B (en) | 2017-04-06 | 2017-04-06 | Data cleaning method and terminal |
PCT/CN2018/074858 WO2018184418A1 (en) | 2017-04-06 | 2018-01-31 | Data cleaning method, terminal and computer readable storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710221427.8A CN107688592B (en) | 2017-04-06 | 2017-04-06 | Data cleaning method and terminal |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107688592A true CN107688592A (en) | 2018-02-13 |
CN107688592B CN107688592B (en) | 2020-03-17 |
Family
ID=61152355
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710221427.8A Active CN107688592B (en) | 2017-04-06 | 2017-04-06 | Data cleaning method and terminal |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN107688592B (en) |
WO (1) | WO2018184418A1 (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111597180A (en) * | 2020-05-19 | 2020-08-28 | 山东汇贸电子口岸有限公司 | Data cleaning method of OTRS system based on storage process |
CN112925772A (en) * | 2019-12-06 | 2021-06-08 | 北京沃东天骏信息技术有限公司 | Data dynamic splitting method and device |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112800043A (en) * | 2021-02-05 | 2021-05-14 | 凯通科技股份有限公司 | Internet of things terminal information extraction method, device, equipment and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8060700B1 (en) * | 2008-12-08 | 2011-11-15 | Nvidia Corporation | System, method and frame buffer logic for evicting dirty data from a cache using counters and data types |
CN103593352A (en) * | 2012-08-15 | 2014-02-19 | 阿里巴巴集团控股有限公司 | Method and device for cleaning mass data |
CN106202346A (en) * | 2016-06-29 | 2016-12-07 | 浙江理工大学 | A kind of data load and clean engine, dispatch and storage system |
CN106294492A (en) * | 2015-06-08 | 2017-01-04 | 深圳中兴网信科技有限公司 | Data cleaning method and cleaning engine |
CN106294745A (en) * | 2016-08-10 | 2017-01-04 | 东方网力科技股份有限公司 | Big data cleaning method and device |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103514205A (en) * | 2012-06-27 | 2014-01-15 | 中国电信股份有限公司 | Mass data processing method and system |
CN103942104A (en) * | 2014-04-23 | 2014-07-23 | 北京金山网络科技有限公司 | Task managing method and device |
CN105205105B (en) * | 2015-08-27 | 2019-04-16 | 浪潮集团有限公司 | A kind of ETL process system and processing method based on storm |
CN105787008A (en) * | 2016-02-23 | 2016-07-20 | 浪潮通用软件有限公司 | Data deduplication cleaning method for large data volume |
CN106484915B (en) * | 2016-11-03 | 2019-10-11 | 国家电网公司信息通信分公司 | A kind of cleaning method and system of mass data |
-
2017
- 2017-04-06 CN CN201710221427.8A patent/CN107688592B/en active Active
-
2018
- 2018-01-31 WO PCT/CN2018/074858 patent/WO2018184418A1/en active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8060700B1 (en) * | 2008-12-08 | 2011-11-15 | Nvidia Corporation | System, method and frame buffer logic for evicting dirty data from a cache using counters and data types |
CN103593352A (en) * | 2012-08-15 | 2014-02-19 | 阿里巴巴集团控股有限公司 | Method and device for cleaning mass data |
CN106294492A (en) * | 2015-06-08 | 2017-01-04 | 深圳中兴网信科技有限公司 | Data cleaning method and cleaning engine |
CN106202346A (en) * | 2016-06-29 | 2016-12-07 | 浙江理工大学 | A kind of data load and clean engine, dispatch and storage system |
CN106294745A (en) * | 2016-08-10 | 2017-01-04 | 东方网力科技股份有限公司 | Big data cleaning method and device |
Non-Patent Citations (1)
Title |
---|
赵鹏: ""基于软件总线模型的数据清洗系统的研究与实现"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112925772A (en) * | 2019-12-06 | 2021-06-08 | 北京沃东天骏信息技术有限公司 | Data dynamic splitting method and device |
CN111597180A (en) * | 2020-05-19 | 2020-08-28 | 山东汇贸电子口岸有限公司 | Data cleaning method of OTRS system based on storage process |
Also Published As
Publication number | Publication date |
---|---|
WO2018184418A1 (en) | 2018-10-11 |
CN107688592B (en) | 2020-03-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107231264A (en) | For the method and apparatus for the capacity for managing Cloud Server | |
CN109285076A (en) | Intelligent core protects processing method, server and storage medium | |
CN107748696A (en) | The method and terminal device of a kind of task scheduling | |
CN109344153A (en) | The processing method and terminal device of business datum | |
CN107688592A (en) | The method and terminal of data cleansing | |
CN108874738A (en) | Distributed parallel operation method, device, computer equipment and storage medium | |
CN104503831B (en) | Equipment optimization method and device | |
CN109669933A (en) | Transaction data intelligent processing method, device and computer readable storage medium | |
CN108255607A (en) | Task processing method, device, electric terminal and readable storage medium storing program for executing | |
CN107784070A (en) | A kind of method, apparatus and equipment for improving data cleansing efficiency | |
CN108183933A (en) | Information push method, apparatus and system, electronic equipment and computer storage media | |
CN109309712A (en) | Data transmission method, server and the storage medium called based on interface asynchronous | |
CN106156998A (en) | The management method of a kind of Pending tasks and device | |
CN108376171A (en) | Method, apparatus, terminal device and the storage medium that big data quickly introduces | |
CN107516158A (en) | A kind of method for allocating tasks, device and terminal device | |
CN108804484A (en) | The data measures and procedures for the examination and approval, equipment and computer readable storage medium | |
CN108153877A (en) | Data dictionary methods of exhibiting, device, terminal device and storage medium | |
CN106639617A (en) | Electronic worshipping system for cemetery | |
CN109344296A (en) | Realize domain life cycle control method, system, server and the storage medium of the HASH key of Redis | |
CN109189790A (en) | Data managing method, device, computer equipment and storage medium | |
CN109086289A (en) | A kind of media data processing method, client, medium and equipment | |
CN110083457A (en) | A kind of data capture method, device and data analysing method, device | |
CN110533396A (en) | Material binding method, material binding device and terminal device | |
CN107993016A (en) | Take control management method and terminal device | |
CN111221650A (en) | System resource recovery method and device based on process type association |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |