CN101609466B

CN101609466B - Method for duplicate checking of mass data and system thereof

Info

Publication number: CN101609466B
Application number: CN2009101085699A
Authority: CN
Inventors: 牛国扬
Original assignee: ZTE Corp
Priority date: 2009-07-01
Filing date: 2009-07-01
Publication date: 2012-11-28
Anticipated expiration: 2029-07-01
Also published as: CN101609466A

Abstract

The invention discloses a method for duplicate checking of mass data and a system thereof, the method comprises the following steps: extracting data key words from the mass data, wherein, the data key words are used for separating the data from other data areas; dividing the data key words according to the first N+M letters of the data key words, and putting the data key words with the same first N+M letters in a file to obtain key word data files; wherein, the first N letters of the data key words are same, the first N+M letters are not exactly same (N and M are nonnegative integers); and performing duplicate checking on the data in the key word data files to obtain the duplicate checking result. The method helps realize a function of independent duplicate checking of mass data in a low configuration environment.

Description

Mass data is looked into heavy method and system

Technical field

The present invention relates to data processing technique, particularly mass data is looked into heavy method and system.

Background technology

Along with telecommunications, move, the expansion of operator's scales of operation such as UNICOM, Netcom and the adjustment of management functions, the data importing of system is derived and is become more and more frequent between operator's built-in system and the operator.In data importing derivation process, inspection becomes more and more important to the mass data correctness, wherein relates to the inspection whether mass data repeats.Repeating data can cause system's operation exception, can cause business processing failure, can cause the user to charge repeating etc., has a strong impact on the normal operation of system.

Existing tool and method is looked into when heavy mass data, need take huge internal memory or need the support of private database, can't realize on ordinary PC that mass data looks into the heavy industry work; And in our routine work, usually need on ordinary PC, carry out mass data and look into the heavy industry work.

Summary of the invention

The objective of the invention is to solve and on ordinary PC, the independent mass data of accomplishing to look into the problem that heavy industry is done, provide mass data to look into heavy method and system, be implemented in and independently carry out mass data in the low configuration surroundings and look into the function that weighs.

The present invention provides a kind of mass data to look into heavy method, and this method comprises:

Extract the data key words in the mass data, said data key words is used for place data and other data fields are separated;

Said data key words cut apart in preceding N+M letter according to said data key words, and preceding N+M the identical data key words of letter put into same file, obtains the key data file; Wherein, the top n letter of said data key words is identical, and preceding N+M letter is incomplete same, and N, M are nonnegative integer);

Respectively the data in each key data file are looked into heavily, obtained looking into heavy result.

The present invention also provides a kind of mass data to look into heavy system, and this system comprises:

Key element is used for extracting the data key words of mass data, and said data key words is used for place data and other data fields are separated;

Cutting unit, said data key words cut apart in preceding N+M letter of the data key words that is used for extracting according to said key element, and preceding N+M alphabetical identical data key words put into same file, obtains the key data file; Wherein, the top n letter of said data key words is identical, and preceding N+M letter is incomplete same, and N, M are nonnegative integer);

Look into heavy unit, the data of each the key data file that is used for respectively said cutting unit is obtained are looked into heavily, obtain looking into heavy result.

Adopt technical scheme of the present invention; Can cut apart mass data according to key word, obtain the little key data file of data volume, and then the key data file is looked into heavily; Therefore; Lower to the running environment requirement, do not need the support of huge internal memory and private database, can on ordinary PC, move.Technical scheme of the present invention carried out the required expense of practical implementation is few, cost is low, look into that heavy speed is fast, efficient is high, reliability is high; Since to running environment require lowly, can in the environment of low configuration, independently look into heavily, easy to implement, be convenient to transplanting.

Description of drawings

Fig. 1 illustrates the schematic flow sheet that mass data of the present invention is looked into heavy method;

Fig. 2 illustrates mass data of the present invention and looks into the general flow chart that heavy method is used;

Fig. 3 illustrates the fundamental diagram of data preprocessing module among Fig. 2;

Fig. 4 illustrates the fundamental diagram of data segmentation module among Fig. 2;

Fig. 5 illustrates the fundamental diagram that data among Fig. 2 are looked into the molality piece;

Fig. 6 illustrates the structural representation that mass data of the present invention is looked into heavy system.

Embodiment

Do detailed elaboration below in conjunction with the accompanying drawing specific embodiments of the invention.Mass data described in the specific embodiment of the invention is meant huge/unprecedented immense data.All need operate mass data in now a lot of business departments; The data that the planning aspect is arranged like planning department; Hydraulic department has the data of water conservancy aspect, and there are the data of meteorological aspect in meteorological department, telecommunications, move, the data between business system inside such as UNICOM, Netcom and the business system.The data volume that these departments handle is all very big.It comprises various environment and cultural data messages such as various spatial datas, report form statistics data, literal, sound, image, hypertext.

Fig. 1 illustrates the embodiment that mass data of the present invention is looked into heavy method, sees also Fig. 1, and a kind of mass data is looked into heavy method, and this method comprises:

101, the data key words in the extraction mass data, data key words is used for place data and other data fields are separated.

102, according to preceding N+M alphabetical partition data key word of data key words, preceding N+M the identical data key words of letter put into same file, obtain the key data file; Wherein, the top n letter of data key words is identical, and preceding N+M letter is incomplete same, and N, M are nonnegative integer).

Concrete, can comprise:

Preceding N+1 the identical data key words of letter put into same file, obtain the preprocessed data file, N+1 letter before the file of preprocessed data file is by name;

Search that data volume is greater than the data file of preset value in the preprocessed data file, the preprocessed data file that finds is as the secondary treating data file, and remaining preprocessed data file is as the key data file;

With N+1 the letter beginning in the past of the data in the secondary treating data file; Mode to push ahead P letter at every turn continues to cut apart, and the data volume in the file that obtains is not more than preset value, and the file that then obtains is as the key data file; Wherein, P is a natural number.

Wherein, preset value is preferably 100,000.The preferred preset value that obtains through testing experiment repeatedly.If preset value greater than 100,000, then can increase in the step 103 data in each key data file are looked into heavy workload; If preset value less than 100,000, then can increase the segmentation times that the secondary treating data file is continued to cut apart.Comprehensive all factors are looked into heavily mass data, 100,000 preferred preset values of conduct, and it is the highest to look into heavy overall efficiency.

Further, for the secondary treating data file, can:

Judge that whether data volume is greater than threshold value in the secondary treating data file;

When data volume is greater than threshold value in the secondary treating data file, with N+1 the letter beginning in the past of the data in the secondary treating data file, continue to cut apart with the mode of pushing ahead 2 letters at every turn, the data volume in the file that obtains is not more than preset value.

Wherein, threshold value is preferably 1,000 ten thousand.When data volume is greater than threshold value in the secondary treating data file, continue to cut apart with the mode of pushing ahead 2 letters at every turn, can improve the efficient that continues to cut apart, and then improve and look into heavy overall efficiency.

103, respectively the data in each key data file are looked into heavily, obtained looking into heavy result.

Further, when respectively the data in each key data file being looked into heavily, obtain the duplicate key digital data, then travel through mass data, if the key word in the mass data exists in the duplicate key digital data, then these data are repeating data; Otherwise these data are normal data; And/or, when respectively the data in each key data file being looked into heavily, do not obtain the duplicate key digital data, then mass data all is a normal data.

Fig. 2 illustrates mass data of the present invention and looks into the general flow chart that heavy method is used.Fig. 3 illustrates the fundamental diagram of data preprocessing module among Fig. 2; Fig. 4 illustrates the fundamental diagram of data segmentation module among Fig. 2; Fig. 5 illustrates the fundamental diagram that data among Fig. 2 are looked into the molality piece.In should using, step 101 among the above embodiment and step 102 are realized that by data preprocessing module and data segmentation module step 103 is looked into the molality piece by data and realized, wherein:

Data preprocessing module is used to realize two functions: the one, and to the extraction of mass data key word, the 2nd, to cutting apart of key word.

Logarithm it is investigated heavily, is that the data key word is looked into heavily.Such as mobile phone user profile being looked into heavily, check mainly whether user key words " user mobile phone number " has repetition.

When data preprocessing module extracts key word, key data is cut apart, can reduce the traversal number of times of data segmentation module, improve system effectiveness.

Data segmentation module is mainly used in " key data file " that the data pre-processing module is obtained and (should uses the file of key data file and the key data file indication among the above embodiment in the scheme inequality; Should be meant the file that data preprocessing module obtains with the key data file in the scheme; Key data file among the above embodiment is meant the definitive document that mass data is cut apart) cut apart by certain principle, obtain the small data file of more convenient processing.

Concrete cutting procedure is following: for a key data file; N is the filename length (not comprising extension name) of this document; If this document data volume is greater than 100,000; Then the data of this document are cut apart by the individual letter of Data Start (N+1), divided data leaves in the file of (N+1) individual letter before file these data by name.Circulation successively, the file in " key data file " is all less than 100,000.

Cut apart principle: through the data of over-segmentation, the data in the file all are with the data of this document beginning by name, have so just guaranteed that repeating data only is present in the same file, can not have the data of mutual repetition in the different files.When looking into heavily,, each file both can as long as being looked into weight separately.Cutting apart is a big data conversion that small data is handled exactly, with the problem of solution machine low memory, and can improve and look into heavy efficient.

Data are looked into the molality piece and are mainly used in: the small data keyed file that obtains according to data segmentation module inquires " duplicate key digital data "; Traversal source document (being mass data) contrasts the data in the source document and " duplicate key digital data ", and then is divided into " normal data " and " repeating data " to raw data; Generate " looking into heavily report " according to looking into heavy result.

See also Fig. 2,3,4,5, the main task of data preprocessing module is to extract key word, and according to keywords difference is carried out the initial partitioning key data.

Be this module of example explanation below with the product information.

The product information form is following:

Name of product, product coding, product description, product type, application time

The specific product data are:

Music30,2361, music service, COS, 20081230

Film012,1234, electric English channel, COS, 20081030

Music12,2363, music service, COS, 20080230

Myboy01,2364, educate English service, COS, 20071230

Music38,1361, music service, COS, 20051230

Suppose that product coding is unique major key of product.When the product information keyword extraction, will extract the secondary series product coding is key word, as long as judge whether product coding repeats, can confirm whether these product data repeat.

Then key word is cut apart.First letter by product coding carries out file division, and segmentation result is that " 1234,1361 " are left among the file 1.txt, and " 2361,2363,2364 " are left among the file 2.txt.Like this, the data in the file all are with file beginning by name.

" data preprocessing module " pressed the reason of beginning initial divided file to key data: when extracting the raw data key word; Need travel through raw data; The traversal in to key data according to first letter mother cut apart; Can reduce data at " data segmentation module " number of times, improve the operational efficiency of entire system the key word traversal.

The skill of data preprocessing module: if the initial of data key words is identical, all be 1 beginning, then when key data is cut apart, can according to keywords start 2 letters and cut apart in " data preprocessing module " such as phone number; If the beginning N of data key words letter is identical, then can according to keywords start (N+1) individual letter and cut apart, be 189XXXXXXXX such as the phone number of telecommunications, can be during then actual cutting apart by 4 letter sequences of Data Start.

Data segmentation module is the nucleus module of native system, and main task is cut apart " data preprocessing module " generation " key data file ", is divided into data less than 100,000 small data file.

Cut apart condition: only data volume in " key data file " is cut apart greater than 100,000 data file.

Data cutting procedure: for one " key data file "; N is the filename length (not comprising extension name) of this document; Then the data of this document are cut apart by the individual letter of Data Start (N+1), divided data leaves in the file of (N+1) letter before file these data by name.Circulation successively, the file in " key data file " is all less than 100,000.

Segmentation result: the data in all " key data files " all are the data with own file prefix by name, and the size of all " key data files " is all less than 100,000.

Article 100,000, origin: 100,000 is to judge whether the data volume condition that continues to cut apart, and this is the data that obtain through testing experiment repeatedly.If judge that value greater than 100,000, then can increase the workload of " data are looked into the molality piece ", if judge that value less than 100,000, then can increase the segmentation times of " data segmentation module ".Comprehensive all factors are looked into heavyly as far as mass data, 100,000 conducts judge whether the condition that continues to cut apart, and it is the highest looking into heavy overall efficiency.

Cut apart skill:, can cut apart by (N+2) individual letter of Data Start when cutting apart greater than 1,000 ten thousand data file for data volume in " key data file ", push ahead 2 letters at every turn.Evidence makes that efficient is higher in this way.

Data are looked into the main task of molality piece rechecking are carried out in " data segmentation module " generation " key data ", obtain " duplicate key digital data ", travel through raw data then.If the key word in the raw data exists in " duplicate key digital data ", then these data are repeating data; If the key word in the raw data does not exist in " duplicate key digital data ", then these data are normal data.Last basis is looked into heavy result and is produced " looking into heavily report ".

" data are looked into the molality piece " at first looks into heavily less than 100,000 " key data file " numerous data volumes one by one, preferably obtains total " duplicate key digital data ".Because the data in each file all are the data with this document beginning by name, thus to each " key data file " look into recuperation to total repeating data be the repeating data in the raw data; Because " key data file " data volume is all less than 100,000, so can guarantee the efficient of inquiring about.

If " duplicate key digital data " do not exist, then explaining does not have repeating data in the raw data, and just needn't travel through raw data again this moment, can directly produce " looking into heavily report ".

Fig. 6 illustrates mass data of the present invention and looks into heavily system implementation example.See also Fig. 6, a kind of mass data is looked into heavy system, and this system comprises:

Key element 601 is used for extracting the data key words of mass data, and data key words is used for place data and other data fields are separated;

Cutting unit 602 is used for preceding N+M alphabetical partition data key word according to the data key words of key element extraction, and preceding N+M the identical data key words of letter put into same file, obtains the key data file; Wherein, the top n letter of data key words is identical, and preceding N+M letter is incomplete same, and N, M are nonnegative integer);

Look into heavy unit 603, the data of each the key data file that is used for respectively cutting unit is obtained are looked into heavily, obtain looking into heavy result.

Wherein, cutting unit can comprise:

The pre-service subelement is used for preceding N+1 the identical data key words of letter put into same file, obtains the preprocessed data file, N+1 letter before the file of preprocessed data file is by name;

Search subelement; Be used for searching the pre-service subelement and obtain the data file of preprocessed data file data volume greater than preset value; The preprocessed data file that finds is as the secondary treating data file, and remaining preprocessed data file is as the key data file;

Handle subelement again; The data that are used for searching the secondary treating data file that subelement finds N+1 letter in the past begin; Mode to push ahead P letter at every turn continues to cut apart, and the data volume in the file that obtains is not more than preset value, and the file that then obtains is as the key data file; Wherein, P is a natural number.

Wherein, pre-service subelement and handle subelement again and can distinguish independent setting also can integrate.

Further, handling subelement again can comprise:

Judge module is used for judging that whether secondary treating data file data volume is greater than threshold value;

Cut apart module; Be used for when judge module is judged secondary treating data file data volume greater than threshold value; With N+1 the letter beginning in the past of the data in the secondary treating data file; Continue to cut apart with the mode of pushing ahead 2 letters at every turn, the data volume in the file that obtains is not more than preset value.

Further, this system can also comprise:.

Unit as a result; Be used for respectively the data of each key data file being looked into heavily, obtain the duplicate key digital data, then travel through mass data when looking into heavy unit; If the key word in the mass data exists in the duplicate key digital data, then these data are repeating data; Otherwise these data are normal data; And/or,

When respectively the data in each key data file being looked into heavily, do not obtain the duplicate key digital data, then mass data all is a normal data.

Mass data in the present embodiment is looked into heavy system and can independently be provided with, and also can be integrated in the ordinary PC.

The above only is an embodiment of the present invention; Should be pointed out that for those skilled in the art, under the prerequisite that does not break away from the principle of the invention; Can also make some improvement and retouching, these improvement and retouching also should be regarded as protection scope of the present invention.

Claims

1. a mass data is looked into heavy method, it is characterized in that this method comprises:

According to preceding N+M the said data key words of Character segmentation of said data key words, the identical data key words of a preceding N+M character is put into same file, obtain the key data file, the data volume in the said key data file is not more than preset value; Wherein, the top n character of the data key words in the different key data files is identical, and a preceding N+M character is inequality, and N, M are positive integer;

2. method according to claim 1; It is characterized in that; Said preceding N+M the said data key words of Character segmentation according to said data key words put into same file with the identical data key words of a preceding N+M character, obtains the key data file and comprises:

The identical data key words of a preceding N+1 character is put into same file, obtain the preprocessed data file, N+1 character before the file of said preprocessed data file is by name;

Search that data volume is greater than the data file of preset value in the said preprocessed data file, the preprocessed data file that finds is as the secondary treating data file, and remaining preprocessed data file is as the key data file;

Data in secondary treating data file the past N+1 character begun; To advance the mode of P character to continue to cut apart backward at every turn, the data volume in the file that obtains is not more than preset value, and the file that then obtains is as the key data file; Wherein, P is a natural number.

3. method according to claim 2 is characterized in that, said preset value is 100,000.

4. according to claim 2 or 3 described methods; It is characterized in that said P is 2, said with the data in the secondary treating data file in the past N+1 character begin; To advance the mode of P character to continue to cut apart backward at every turn, the data volume in the file that obtains is not more than preset value and comprises:

Judge that whether data volume is greater than threshold value in the said secondary treating data file;

When data volume is greater than threshold value in the said secondary treating data file; Data in secondary treating data file the past N+1 character begun; To advance the mode of 2 characters to continue to cut apart backward at every turn, the data volume in the file that obtains is not more than preset value.

5. method according to claim 4 is characterized in that, said threshold value is 1,000 ten thousand.

6. method according to claim 1 is characterized in that, this method also comprises:

When respectively the data in each key data file being looked into heavily; Obtain the duplicate key digital data; Then travel through said mass data, if the data key words in the said mass data exists in said duplicate key digital data, then the corresponding mass data of this data key words is a repeating data; Otherwise this mass data is a normal data; And/or,

When respectively the data in each key data file being looked into heavily, do not obtain the duplicate key digital data, then said mass data all is a normal data.

7. a mass data is looked into heavy system, it is characterized in that this system comprises:

Cutting unit is used for preceding N+M the said data key words of Character segmentation according to the data key words of said key element extraction, and the identical data key words of a preceding N+M character is put into same file, obtains the key data file; Data volume in the said key data file is not more than preset value; Wherein, the top n character of the data key words in the different key data files is identical, and a preceding N+M character is inequality, and N, M are positive integer;

8. system according to claim 7 is characterized in that, said cutting unit comprises:

The pre-service subelement is used for the identical data key words of a preceding N+1 character is put into same file, obtains the preprocessed data file, N+1 character before the file of said preprocessed data file is by name;

Search subelement; Be used for searching said pre-service subelement and obtain the data file of preprocessed data file data volume greater than preset value; The preprocessed data file that finds is as the secondary treating data file, and remaining preprocessed data file is as the key data file;

Handle subelement again; Be used for said data of searching the secondary treating data file that subelement finds in the past N+1 character begin; To advance the mode of P character to continue to cut apart backward at every turn, the data volume in the file that obtains is not more than preset value, and the file that then obtains is as the key data file; Wherein, P is a natural number.

9. system according to claim 8 is characterized in that, the said subelement of handling again comprises:

Judge module is used for judging that whether said secondary treating data file data volume is greater than threshold value;

Cut apart module; Be used for when said judge module is judged said secondary treating data file data volume greater than threshold value; Data in secondary treating data file the past N+1 character begun; To advance the mode of 2 characters to continue to cut apart backward at every turn, the data volume in the file that obtains is not more than preset value.

10. system according to claim 7 is characterized in that, this system also comprises:

Unit as a result; Being used for saidly looking into heavy unit looks into heavily the data of each key data file respectively; Obtain the duplicate key digital data; Then travel through said mass data, if the data key words in the said mass data exists in said duplicate key digital data, then the corresponding mass data of this data key words is a repeating data; Otherwise this mass data is a normal data; And/or,