CN101609466B - Method for duplicate checking of mass data and system thereof - Google Patents

Method for duplicate checking of mass data and system thereof Download PDF

Info

Publication number
CN101609466B
CN101609466B CN2009101085699A CN200910108569A CN101609466B CN 101609466 B CN101609466 B CN 101609466B CN 2009101085699 A CN2009101085699 A CN 2009101085699A CN 200910108569 A CN200910108569 A CN 200910108569A CN 101609466 B CN101609466 B CN 101609466B
Authority
CN
China
Prior art keywords
data
file
key
character
key words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN2009101085699A
Other languages
Chinese (zh)
Other versions
CN101609466A (en
Inventor
牛国扬
Original Assignee
ZTE Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ZTE Corp filed Critical ZTE Corp
Priority to CN2009101085699A priority Critical patent/CN101609466B/en
Publication of CN101609466A publication Critical patent/CN101609466A/en
Application granted granted Critical
Publication of CN101609466B publication Critical patent/CN101609466B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method for duplicate checking of mass data and a system thereof, the method comprises the following steps: extracting data key words from the mass data, wherein, the data key words are used for separating the data from other data areas; dividing the data key words according to the first N+M letters of the data key words, and putting the data key words with the same first N+M letters in a file to obtain key word data files; wherein, the first N letters of the data key words are same, the first N+M letters are not exactly same (N and M are nonnegative integers); and performing duplicate checking on the data in the key word data files to obtain the duplicate checking result. The method helps realize a function of independent duplicate checking of mass data in a low configuration environment.

Description

Mass data is looked into heavy method and system
Technical field
The present invention relates to data processing technique, particularly mass data is looked into heavy method and system.
Background technology
Along with telecommunications, move, the expansion of operator's scales of operation such as UNICOM, Netcom and the adjustment of management functions, the data importing of system is derived and is become more and more frequent between operator's built-in system and the operator.In data importing derivation process, inspection becomes more and more important to the mass data correctness, wherein relates to the inspection whether mass data repeats.Repeating data can cause system's operation exception, can cause business processing failure, can cause the user to charge repeating etc., has a strong impact on the normal operation of system.
Existing tool and method is looked into when heavy mass data, need take huge internal memory or need the support of private database, can't realize on ordinary PC that mass data looks into the heavy industry work; And in our routine work, usually need on ordinary PC, carry out mass data and look into the heavy industry work.
Summary of the invention
The objective of the invention is to solve and on ordinary PC, the independent mass data of accomplishing to look into the problem that heavy industry is done, provide mass data to look into heavy method and system, be implemented in and independently carry out mass data in the low configuration surroundings and look into the function that weighs.
The present invention provides a kind of mass data to look into heavy method, and this method comprises:
Extract the data key words in the mass data, said data key words is used for place data and other data fields are separated;
Said data key words cut apart in preceding N+M letter according to said data key words, and preceding N+M the identical data key words of letter put into same file, obtains the key data file; Wherein, the top n letter of said data key words is identical, and preceding N+M letter is incomplete same, and N, M are nonnegative integer);
Respectively the data in each key data file are looked into heavily, obtained looking into heavy result.
The present invention also provides a kind of mass data to look into heavy system, and this system comprises:
Key element is used for extracting the data key words of mass data, and said data key words is used for place data and other data fields are separated;
Cutting unit, said data key words cut apart in preceding N+M letter of the data key words that is used for extracting according to said key element, and preceding N+M alphabetical identical data key words put into same file, obtains the key data file; Wherein, the top n letter of said data key words is identical, and preceding N+M letter is incomplete same, and N, M are nonnegative integer);
Look into heavy unit, the data of each the key data file that is used for respectively said cutting unit is obtained are looked into heavily, obtain looking into heavy result.
Adopt technical scheme of the present invention; Can cut apart mass data according to key word, obtain the little key data file of data volume, and then the key data file is looked into heavily; Therefore; Lower to the running environment requirement, do not need the support of huge internal memory and private database, can on ordinary PC, move.Technical scheme of the present invention carried out the required expense of practical implementation is few, cost is low, look into that heavy speed is fast, efficient is high, reliability is high; Since to running environment require lowly, can in the environment of low configuration, independently look into heavily, easy to implement, be convenient to transplanting.
Description of drawings
Fig. 1 illustrates the schematic flow sheet that mass data of the present invention is looked into heavy method;
Fig. 2 illustrates mass data of the present invention and looks into the general flow chart that heavy method is used;
Fig. 3 illustrates the fundamental diagram of data preprocessing module among Fig. 2;
Fig. 4 illustrates the fundamental diagram of data segmentation module among Fig. 2;
Fig. 5 illustrates the fundamental diagram that data among Fig. 2 are looked into the molality piece;
Fig. 6 illustrates the structural representation that mass data of the present invention is looked into heavy system.
Embodiment
Do detailed elaboration below in conjunction with the accompanying drawing specific embodiments of the invention.Mass data described in the specific embodiment of the invention is meant huge/unprecedented immense data.All need operate mass data in now a lot of business departments; The data that the planning aspect is arranged like planning department; Hydraulic department has the data of water conservancy aspect, and there are the data of meteorological aspect in meteorological department, telecommunications, move, the data between business system inside such as UNICOM, Netcom and the business system.The data volume that these departments handle is all very big.It comprises various environment and cultural data messages such as various spatial datas, report form statistics data, literal, sound, image, hypertext.
Fig. 1 illustrates the embodiment that mass data of the present invention is looked into heavy method, sees also Fig. 1, and a kind of mass data is looked into heavy method, and this method comprises:
101, the data key words in the extraction mass data, data key words is used for place data and other data fields are separated.
102, according to preceding N+M alphabetical partition data key word of data key words, preceding N+M the identical data key words of letter put into same file, obtain the key data file; Wherein, the top n letter of data key words is identical, and preceding N+M letter is incomplete same, and N, M are nonnegative integer).
Concrete, can comprise:
Preceding N+1 the identical data key words of letter put into same file, obtain the preprocessed data file, N+1 letter before the file of preprocessed data file is by name;
Search that data volume is greater than the data file of preset value in the preprocessed data file, the preprocessed data file that finds is as the secondary treating data file, and remaining preprocessed data file is as the key data file;
With N+1 the letter beginning in the past of the data in the secondary treating data file; Mode to push ahead P letter at every turn continues to cut apart, and the data volume in the file that obtains is not more than preset value, and the file that then obtains is as the key data file; Wherein, P is a natural number.
Wherein, preset value is preferably 100,000.The preferred preset value that obtains through testing experiment repeatedly.If preset value greater than 100,000, then can increase in the step 103 data in each key data file are looked into heavy workload; If preset value less than 100,000, then can increase the segmentation times that the secondary treating data file is continued to cut apart.Comprehensive all factors are looked into heavily mass data, 100,000 preferred preset values of conduct, and it is the highest to look into heavy overall efficiency.
Further, for the secondary treating data file, can:
Judge that whether data volume is greater than threshold value in the secondary treating data file;
When data volume is greater than threshold value in the secondary treating data file, with N+1 the letter beginning in the past of the data in the secondary treating data file, continue to cut apart with the mode of pushing ahead 2 letters at every turn, the data volume in the file that obtains is not more than preset value.
Wherein, threshold value is preferably 1,000 ten thousand.When data volume is greater than threshold value in the secondary treating data file, continue to cut apart with the mode of pushing ahead 2 letters at every turn, can improve the efficient that continues to cut apart, and then improve and look into heavy overall efficiency.
103, respectively the data in each key data file are looked into heavily, obtained looking into heavy result.
Further, when respectively the data in each key data file being looked into heavily, obtain the duplicate key digital data, then travel through mass data, if the key word in the mass data exists in the duplicate key digital data, then these data are repeating data; Otherwise these data are normal data; And/or, when respectively the data in each key data file being looked into heavily, do not obtain the duplicate key digital data, then mass data all is a normal data.
Fig. 2 illustrates mass data of the present invention and looks into the general flow chart that heavy method is used.Fig. 3 illustrates the fundamental diagram of data preprocessing module among Fig. 2; Fig. 4 illustrates the fundamental diagram of data segmentation module among Fig. 2; Fig. 5 illustrates the fundamental diagram that data among Fig. 2 are looked into the molality piece.In should using, step 101 among the above embodiment and step 102 are realized that by data preprocessing module and data segmentation module step 103 is looked into the molality piece by data and realized, wherein:
Data preprocessing module is used to realize two functions: the one, and to the extraction of mass data key word, the 2nd, to cutting apart of key word.
Logarithm it is investigated heavily, is that the data key word is looked into heavily.Such as mobile phone user profile being looked into heavily, check mainly whether user key words " user mobile phone number " has repetition.
When data preprocessing module extracts key word, key data is cut apart, can reduce the traversal number of times of data segmentation module, improve system effectiveness.
Data segmentation module is mainly used in " key data file " that the data pre-processing module is obtained and (should uses the file of key data file and the key data file indication among the above embodiment in the scheme inequality; Should be meant the file that data preprocessing module obtains with the key data file in the scheme; Key data file among the above embodiment is meant the definitive document that mass data is cut apart) cut apart by certain principle, obtain the small data file of more convenient processing.
Concrete cutting procedure is following: for a key data file; N is the filename length (not comprising extension name) of this document; If this document data volume is greater than 100,000; Then the data of this document are cut apart by the individual letter of Data Start (N+1), divided data leaves in the file of (N+1) individual letter before file these data by name.Circulation successively, the file in " key data file " is all less than 100,000.
Cut apart principle: through the data of over-segmentation, the data in the file all are with the data of this document beginning by name, have so just guaranteed that repeating data only is present in the same file, can not have the data of mutual repetition in the different files.When looking into heavily,, each file both can as long as being looked into weight separately.Cutting apart is a big data conversion that small data is handled exactly, with the problem of solution machine low memory, and can improve and look into heavy efficient.
Data are looked into the molality piece and are mainly used in: the small data keyed file that obtains according to data segmentation module inquires " duplicate key digital data "; Traversal source document (being mass data) contrasts the data in the source document and " duplicate key digital data ", and then is divided into " normal data " and " repeating data " to raw data; Generate " looking into heavily report " according to looking into heavy result.
See also Fig. 2,3,4,5, the main task of data preprocessing module is to extract key word, and according to keywords difference is carried out the initial partitioning key data.
Be this module of example explanation below with the product information.
The product information form is following:
Name of product, product coding, product description, product type, application time
The specific product data are:
Music30,2361, music service, COS, 20081230
Film012,1234, electric English channel, COS, 20081030
Music12,2363, music service, COS, 20080230
Myboy01,2364, educate English service, COS, 20071230
Music38,1361, music service, COS, 20051230
Suppose that product coding is unique major key of product.When the product information keyword extraction, will extract the secondary series product coding is key word, as long as judge whether product coding repeats, can confirm whether these product data repeat.
Then key word is cut apart.First letter by product coding carries out file division, and segmentation result is that " 1234,1361 " are left among the file 1.txt, and " 2361,2363,2364 " are left among the file 2.txt.Like this, the data in the file all are with file beginning by name.
" data preprocessing module " pressed the reason of beginning initial divided file to key data: when extracting the raw data key word; Need travel through raw data; The traversal in to key data according to first letter mother cut apart; Can reduce data at " data segmentation module " number of times, improve the operational efficiency of entire system the key word traversal.
The skill of data preprocessing module: if the initial of data key words is identical, all be 1 beginning, then when key data is cut apart, can according to keywords start 2 letters and cut apart in " data preprocessing module " such as phone number; If the beginning N of data key words letter is identical, then can according to keywords start (N+1) individual letter and cut apart, be 189XXXXXXXX such as the phone number of telecommunications, can be during then actual cutting apart by 4 letter sequences of Data Start.
Data segmentation module is the nucleus module of native system, and main task is cut apart " data preprocessing module " generation " key data file ", is divided into data less than 100,000 small data file.
Cut apart condition: only data volume in " key data file " is cut apart greater than 100,000 data file.
Data cutting procedure: for one " key data file "; N is the filename length (not comprising extension name) of this document; Then the data of this document are cut apart by the individual letter of Data Start (N+1), divided data leaves in the file of (N+1) letter before file these data by name.Circulation successively, the file in " key data file " is all less than 100,000.
Segmentation result: the data in all " key data files " all are the data with own file prefix by name, and the size of all " key data files " is all less than 100,000.
Article 100,000, origin: 100,000 is to judge whether the data volume condition that continues to cut apart, and this is the data that obtain through testing experiment repeatedly.If judge that value greater than 100,000, then can increase the workload of " data are looked into the molality piece ", if judge that value less than 100,000, then can increase the segmentation times of " data segmentation module ".Comprehensive all factors are looked into heavyly as far as mass data, 100,000 conducts judge whether the condition that continues to cut apart, and it is the highest looking into heavy overall efficiency.
Cut apart skill:, can cut apart by (N+2) individual letter of Data Start when cutting apart greater than 1,000 ten thousand data file for data volume in " key data file ", push ahead 2 letters at every turn.Evidence makes that efficient is higher in this way.
Data are looked into the main task of molality piece rechecking are carried out in " data segmentation module " generation " key data ", obtain " duplicate key digital data ", travel through raw data then.If the key word in the raw data exists in " duplicate key digital data ", then these data are repeating data; If the key word in the raw data does not exist in " duplicate key digital data ", then these data are normal data.Last basis is looked into heavy result and is produced " looking into heavily report ".
" data are looked into the molality piece " at first looks into heavily less than 100,000 " key data file " numerous data volumes one by one, preferably obtains total " duplicate key digital data ".Because the data in each file all are the data with this document beginning by name, thus to each " key data file " look into recuperation to total repeating data be the repeating data in the raw data; Because " key data file " data volume is all less than 100,000, so can guarantee the efficient of inquiring about.
If " duplicate key digital data " do not exist, then explaining does not have repeating data in the raw data, and just needn't travel through raw data again this moment, can directly produce " looking into heavily report ".
Fig. 6 illustrates mass data of the present invention and looks into heavily system implementation example.See also Fig. 6, a kind of mass data is looked into heavy system, and this system comprises:
Key element 601 is used for extracting the data key words of mass data, and data key words is used for place data and other data fields are separated;
Cutting unit 602 is used for preceding N+M alphabetical partition data key word according to the data key words of key element extraction, and preceding N+M the identical data key words of letter put into same file, obtains the key data file; Wherein, the top n letter of data key words is identical, and preceding N+M letter is incomplete same, and N, M are nonnegative integer);
Look into heavy unit 603, the data of each the key data file that is used for respectively cutting unit is obtained are looked into heavily, obtain looking into heavy result.
Wherein, cutting unit can comprise:
The pre-service subelement is used for preceding N+1 the identical data key words of letter put into same file, obtains the preprocessed data file, N+1 letter before the file of preprocessed data file is by name;
Search subelement; Be used for searching the pre-service subelement and obtain the data file of preprocessed data file data volume greater than preset value; The preprocessed data file that finds is as the secondary treating data file, and remaining preprocessed data file is as the key data file;
Handle subelement again; The data that are used for searching the secondary treating data file that subelement finds N+1 letter in the past begin; Mode to push ahead P letter at every turn continues to cut apart, and the data volume in the file that obtains is not more than preset value, and the file that then obtains is as the key data file; Wherein, P is a natural number.
Wherein, pre-service subelement and handle subelement again and can distinguish independent setting also can integrate.
Further, handling subelement again can comprise:
Judge module is used for judging that whether secondary treating data file data volume is greater than threshold value;
Cut apart module; Be used for when judge module is judged secondary treating data file data volume greater than threshold value; With N+1 the letter beginning in the past of the data in the secondary treating data file; Continue to cut apart with the mode of pushing ahead 2 letters at every turn, the data volume in the file that obtains is not more than preset value.
Further, this system can also comprise:.
Unit as a result; Be used for respectively the data of each key data file being looked into heavily, obtain the duplicate key digital data, then travel through mass data when looking into heavy unit; If the key word in the mass data exists in the duplicate key digital data, then these data are repeating data; Otherwise these data are normal data; And/or,
When respectively the data in each key data file being looked into heavily, do not obtain the duplicate key digital data, then mass data all is a normal data.
Mass data in the present embodiment is looked into heavy system and can independently be provided with, and also can be integrated in the ordinary PC.
Adopt technical scheme of the present invention; Can cut apart mass data according to key word, obtain the little key data file of data volume, and then the key data file is looked into heavily; Therefore; Lower to the running environment requirement, do not need the support of huge internal memory and private database, can on ordinary PC, move.Technical scheme of the present invention carried out the required expense of practical implementation is few, cost is low, look into that heavy speed is fast, efficient is high, reliability is high; Since to running environment require lowly, can in the environment of low configuration, independently look into heavily, easy to implement, be convenient to transplanting.
The above only is an embodiment of the present invention; Should be pointed out that for those skilled in the art, under the prerequisite that does not break away from the principle of the invention; Can also make some improvement and retouching, these improvement and retouching also should be regarded as protection scope of the present invention.

Claims (10)

1. a mass data is looked into heavy method, it is characterized in that this method comprises:
Extract the data key words in the mass data, said data key words is used for place data and other data fields are separated;
According to preceding N+M the said data key words of Character segmentation of said data key words, the identical data key words of a preceding N+M character is put into same file, obtain the key data file, the data volume in the said key data file is not more than preset value; Wherein, the top n character of the data key words in the different key data files is identical, and a preceding N+M character is inequality, and N, M are positive integer;
Respectively the data in each key data file are looked into heavily, obtained looking into heavy result.
2. method according to claim 1; It is characterized in that; Said preceding N+M the said data key words of Character segmentation according to said data key words put into same file with the identical data key words of a preceding N+M character, obtains the key data file and comprises:
The identical data key words of a preceding N+1 character is put into same file, obtain the preprocessed data file, N+1 character before the file of said preprocessed data file is by name;
Search that data volume is greater than the data file of preset value in the said preprocessed data file, the preprocessed data file that finds is as the secondary treating data file, and remaining preprocessed data file is as the key data file;
Data in secondary treating data file the past N+1 character begun; To advance the mode of P character to continue to cut apart backward at every turn, the data volume in the file that obtains is not more than preset value, and the file that then obtains is as the key data file; Wherein, P is a natural number.
3. method according to claim 2 is characterized in that, said preset value is 100,000.
4. according to claim 2 or 3 described methods; It is characterized in that said P is 2, said with the data in the secondary treating data file in the past N+1 character begin; To advance the mode of P character to continue to cut apart backward at every turn, the data volume in the file that obtains is not more than preset value and comprises:
Judge that whether data volume is greater than threshold value in the said secondary treating data file;
When data volume is greater than threshold value in the said secondary treating data file; Data in secondary treating data file the past N+1 character begun; To advance the mode of 2 characters to continue to cut apart backward at every turn, the data volume in the file that obtains is not more than preset value.
5. method according to claim 4 is characterized in that, said threshold value is 1,000 ten thousand.
6. method according to claim 1 is characterized in that, this method also comprises:
When respectively the data in each key data file being looked into heavily; Obtain the duplicate key digital data; Then travel through said mass data, if the data key words in the said mass data exists in said duplicate key digital data, then the corresponding mass data of this data key words is a repeating data; Otherwise this mass data is a normal data; And/or,
When respectively the data in each key data file being looked into heavily, do not obtain the duplicate key digital data, then said mass data all is a normal data.
7. a mass data is looked into heavy system, it is characterized in that this system comprises:
Key element is used for extracting the data key words of mass data, and said data key words is used for place data and other data fields are separated;
Cutting unit is used for preceding N+M the said data key words of Character segmentation according to the data key words of said key element extraction, and the identical data key words of a preceding N+M character is put into same file, obtains the key data file; Data volume in the said key data file is not more than preset value; Wherein, the top n character of the data key words in the different key data files is identical, and a preceding N+M character is inequality, and N, M are positive integer;
Look into heavy unit, the data of each the key data file that is used for respectively said cutting unit is obtained are looked into heavily, obtain looking into heavy result.
8. system according to claim 7 is characterized in that, said cutting unit comprises:
The pre-service subelement is used for the identical data key words of a preceding N+1 character is put into same file, obtains the preprocessed data file, N+1 character before the file of said preprocessed data file is by name;
Search subelement; Be used for searching said pre-service subelement and obtain the data file of preprocessed data file data volume greater than preset value; The preprocessed data file that finds is as the secondary treating data file, and remaining preprocessed data file is as the key data file;
Handle subelement again; Be used for said data of searching the secondary treating data file that subelement finds in the past N+1 character begin; To advance the mode of P character to continue to cut apart backward at every turn, the data volume in the file that obtains is not more than preset value, and the file that then obtains is as the key data file; Wherein, P is a natural number.
9. system according to claim 8 is characterized in that, the said subelement of handling again comprises:
Judge module is used for judging that whether said secondary treating data file data volume is greater than threshold value;
Cut apart module; Be used for when said judge module is judged said secondary treating data file data volume greater than threshold value; Data in secondary treating data file the past N+1 character begun; To advance the mode of 2 characters to continue to cut apart backward at every turn, the data volume in the file that obtains is not more than preset value.
10. system according to claim 7 is characterized in that, this system also comprises:
Unit as a result; Being used for saidly looking into heavy unit looks into heavily the data of each key data file respectively; Obtain the duplicate key digital data; Then travel through said mass data, if the data key words in the said mass data exists in said duplicate key digital data, then the corresponding mass data of this data key words is a repeating data; Otherwise this mass data is a normal data; And/or,
When respectively the data in each key data file being looked into heavily, do not obtain the duplicate key digital data, then said mass data all is a normal data.
CN2009101085699A 2009-07-01 2009-07-01 Method for duplicate checking of mass data and system thereof Active CN101609466B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2009101085699A CN101609466B (en) 2009-07-01 2009-07-01 Method for duplicate checking of mass data and system thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2009101085699A CN101609466B (en) 2009-07-01 2009-07-01 Method for duplicate checking of mass data and system thereof

Publications (2)

Publication Number Publication Date
CN101609466A CN101609466A (en) 2009-12-23
CN101609466B true CN101609466B (en) 2012-11-28

Family

ID=41483223

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2009101085699A Active CN101609466B (en) 2009-07-01 2009-07-01 Method for duplicate checking of mass data and system thereof

Country Status (1)

Country Link
CN (1) CN101609466B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102156703A (en) * 2011-01-24 2011-08-17 南开大学 Low-power consumption high-performance repeating data deleting system
CN104462527A (en) * 2014-12-22 2015-03-25 龙信数据(北京)有限公司 Data deduplication method and device
CN104778154A (en) * 2015-04-09 2015-07-15 天脉聚源(北京)教育科技有限公司 Partitioning method and device of txt text data
CN107844520A (en) * 2017-10-09 2018-03-27 平安科技(深圳)有限公司 Electronic installation, vehicle data introduction method and storage medium
CN110807119B (en) * 2018-07-19 2022-07-19 浙江宇视科技有限公司 Face duplicate checking method and device
CN115185981B (en) * 2022-09-14 2022-11-25 吉奥时空信息技术股份有限公司 Data duplication checking method and device considering super-large table

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101154228A (en) * 2006-09-27 2008-04-02 西门子公司 Partitioned pattern matching method and device thereof
CN101159795A (en) * 2007-10-25 2008-04-09 中兴通讯股份有限公司 Calling list rearrangement method and device
CN101447886A (en) * 2007-11-26 2009-06-03 华为技术有限公司 Method for comparing mass data and device thereof

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101154228A (en) * 2006-09-27 2008-04-02 西门子公司 Partitioned pattern matching method and device thereof
CN101159795A (en) * 2007-10-25 2008-04-09 中兴通讯股份有限公司 Calling list rearrangement method and device
CN101447886A (en) * 2007-11-26 2009-06-03 华为技术有限公司 Method for comparing mass data and device thereof

Also Published As

Publication number Publication date
CN101609466A (en) 2009-12-23

Similar Documents

Publication Publication Date Title
CN101609466B (en) Method for duplicate checking of mass data and system thereof
CN101446940B (en) Method and device of automatically generating a summary for document set
KR100918847B1 (en) Device for generating ontology instance automatically and method therefor
CN102722709A (en) Method and device for identifying garbage pictures
CN102662952A (en) Chinese text parallel data mining method based on hierarchy
KR101617696B1 (en) Method and device for mining data regular expression
CN101794307A (en) Vehicle navigation POI (Point of Interest) search engine based on internetwork word segmentation idea
CN104281702A (en) Power keyword segmentation based data retrieval method and device
CN109522011A (en) A kind of code line recommended method of context depth perception live based on programming
CN101369278B (en) Approximate adaptation method and apparatus
CN103294820B (en) WEB page classifying method and system based on semantic extension
CN102591612A (en) General webpage text extraction method based on punctuation continuity and system thereof
CN103345496A (en) Multimedia information searching method and system
CN101763394A (en) Method for searching secret-related files in computer system
CN101826099A (en) Method and system for identifying similar documents and determining document diffusance
CN102880650A (en) Data matching method and device
CN102567494A (en) Website classification method and device
CN102650986A (en) Synonym expansion method and device both used for text duplication detection
CN105550169A (en) Method and device for identifying point of interest names based on character length
CN104216979A (en) Chinese technology patent automatic classification system and method for patent classification by using system
CN103176905B (en) A kind of Defect Correlation method and device
CN103136212A (en) Mining method of class new words and device
CN113627132B (en) Data deduplication marking code generation method, system, electronic equipment and storage medium
CN105404677A (en) Tree structure based retrieval method
US10223529B2 (en) Indexing apparatus and method for search of security monitoring data

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C41 Transfer of patent application or patent right or utility model
TR01 Transfer of patent right

Effective date of registration: 20160622

Address after: 210000 thousand people building, 7 Cui Lu, Jiangning Development Zone, Nanjing, Jiangsu

Patentee after: Liu Fei

Address before: 518057 Nanshan District science and Technology Park, Guangdong, South Road, ZTE building, science and Technology Park

Patentee before: ZTE Corporation

CB03 Change of inventor or designer information

Inventor after: Liu Fei

Inventor before: Niu Guoyang

COR Change of bibliographic data