CN101414309A - System for processing repeat arrangement of large scale data information - Google Patents

System for processing repeat arrangement of large scale data information Download PDF

Info

Publication number
CN101414309A
CN101414309A CNA2008102034399A CN200810203439A CN101414309A CN 101414309 A CN101414309 A CN 101414309A CN A2008102034399 A CNA2008102034399 A CN A2008102034399A CN 200810203439 A CN200810203439 A CN 200810203439A CN 101414309 A CN101414309 A CN 101414309A
Authority
CN
China
Prior art keywords
data
module
large scale
data information
high speed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CNA2008102034399A
Other languages
Chinese (zh)
Inventor
韩定一
周云庆
袁若石
薛贵荣
俞勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CNA2008102034399A priority Critical patent/CN101414309A/en
Publication of CN101414309A publication Critical patent/CN101414309A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a large scale data information filter processing system in the technical field of information processing which comprises an inputting module, a digital coding module, a multi-point detecting module and an output module; the input module receives the original data needing to be filtered, namely a data example; the digital coding module carries out re-coding on the data example obtained by the input module and compresses a data space to a space equal to or a little larger than the scale of the actual data example of a problem to be solved; the multi-point detecting module samples new codes for a plurality of times, builds a mapping relation with an address which is memorized at a high speed, records whether a certain data example appears or not by controlling the marking position of the corresponding address that is memorized at a high speed, thereby realizing the function of filter; the output module returns the filtered data to a user. The large scale data information filter processing system has the characteristics of effectiveness, large data processing volume, and the like.

Description

System for processing repeat arrangement of large scale data information
Technical field
What the present invention relates to is a kind of system of technical field of information processing, and specifically, what relate to is a kind of system for processing repeat arrangement of large scale data information.
Background technology
Along with the continuous development of the information processing technology, application scenes occurring need investigate the work that repeats to mass data.For example: need in search engine system to judge which webpage is embodied in the system.Because the webpage One's name is legion on the internet need have special system that newfound network address is judged, look at whether it has been climbed to get and the index mistake, if exist, perhaps need the follow-up work of upgrading index; And if not existence is as yet so just carried out the follow-up work of newly-built index possibly.And for example:, need analyze comparison to a large amount of gene informations in some bio-science research fields.Need also to judge whether gene information had done corresponding the processing, and carry out different follow-up works.And aspect telecommunication service, also need the record data of about tens various telecommunication services are judged the work of duplicate record, in order to avoid the situation of overcharge appears.
Data in these are used have following 3 denominators.
At first, data space is very big.With this class data instance of network address, generally speaking, it is made up of numeral, letter (capital and small letter is relevant), "-" and ". ", generally is no more than 100 characters (in fact Kuo Zhan network address can be made up of nearly all ascii character, and can reach 2000 characters).Such data may always have 64 100(about 10 180) individual.General System can't be handled so huge data space.
Secondly, the shared space of real data may not be full of whole data space, and perhaps the actual amount of data that may occur in a task is not to be full of whole data space.Still with network address as an example, according to China Internet Network Information Center's statistics, the website quantity of China is on 1,000,000 these orders of magnitude.And the index pages quantity of commercial search engine is greatly on 10,000,000,000 these orders of magnitude.With respect to 10 180Individual possible network address, 10,000,000,000 are actually a very little numeral.On the order of magnitude, differ from 10 170We can say that data are very sparse.
At last, the shared space of real data is again the task of being difficult to finish for present computer system.Calculate with 10,000,000,000 network address, suppose that average each network address length is 30 characters (4 bytes of each character), storing these network address needs about 1.2PB space altogether, is about 1,000 times of present main flow hard drive space capacity (500GB-1TB).And 10,000,000,000 records want (millisecond rank) inquiry of response rapidly to judge that whether a network address has existed also is almost impossible for the unit Database Systems.Therefore present existing solution mostly adopts distributed architecture, and storage, index and computational load are distributed on hundreds and thousands of the machines.Accomplish a task jointly by the some small-scale problems of parallel processing by cutting.Because involve the network communication and the work stationary problem of a large amount of machines, the stability of system and reliability are not very good.
Through the literature search of prior art is found, Chinese patent application " based on the mass toll-ticket fast cross rearrangement of internal memory " (publication number CN1897629) proposes a kind ofly to adopt multi-level storage machine system based on internal memory, based on the index technology of y-bend balanced tree and key tree, based on the compress technique of binary-coded decimal and RLC algorithm and the process that combines based on the cross rearrangement of timeslice.Went up 45240988 records of processing time-consuming 4467 seconds at IBM P650 (16 1.5GHz CPU, 32GB RAM).Resurvey about 20 times of raising speed the 86669 seconds used time of examination with respect to traditional intersection row based on database.However, it has still adopted the third level memory mechanism based on hard disk, and this mechanism still can be brought influence to the further lifting of system speed.
Summary of the invention
The objective of the invention is at the deficiencies in the prior art, a kind of system for processing repeat arrangement of large scale data information is provided, it can efficiently handle the information of extensive possibility repeating data, and its output does not have repeating data, and the efficient of bringing with the versatility design that overcomes legacy system reduces problem.
The present invention is achieved by the following technical solutions, the present invention includes four modules: load module, digital coding module, multiple spot inspection module, output module.Wherein:
Described load module receives need arrange heavy raw data, i.e. data instance;
Described digital coding module carries out recompile with the data instance that load module obtains, and the scale that data space is compressed to and waits separate problem real data example quite or big slightly space;
Described multiple spot inspection module is repeatedly sampled to new coding, and sets up mapping relations with the address of high speed storing, write down by the zone bit of controlling the high speed storing appropriate address some data instances whether to occur, thus the heavy function of the row of realization;
Described output module will return to the user through the heavy data of row.
Described load module is responsible for receiving the user's data example, can be the file input, network flow input or the like, for example: receive the various network address of from webpage, finding.These data instances do not have specific sequence requirement, and a data instance may repeatedly occur when receiving, and the centre is mingled with other data instances.
Described digital coding module carries out the fast coding conversion with data instance, can adopt hash functions such as MD5, SHA-1 to realize.And with the data compression in the luv space to by the numerical space of k position 01 sequence.The coding figure place of digital coding module should be slightly larger than the numerical value quantity of final output.Wherein, the value of k is 16 or 32 integral multiple normally, need carry out suitable parameter according to practical problems and select.Relatively Chang Yong k value can be 128 or 160.
Described multiple spot inspection module, the numerical value that will be generated by digital coding module and the address of high speed storing are set up the multiple spot corresponding relation and (being comprised: multiple spot corresponding relation and single-point corresponding relation.As: several parts that will newly encode are mapped as the address of high speed storing respectively, thereby a new coding are mapped to a plurality of memory locations of high speed storing.Or will newly encode directly be mapped as the address of high speed storing), and write down the module that some data instances whether occurred by the zone bit of control high speed storing appropriate address.It can inquire about rapidly in the high-speed storage device particular address position whether all the mark judgment data of coming whether occurred.In addition, the problem scale of handling when needs relatively hour can be reduced to the multiple spot check single-point check, with further elevator system performance.
Described output module, with the multiple spot inspection module declare heavy after, the data instance arrangement that did not repeat also finally returns to the user.
The coding figure place of the digital coding module among the present invention and multiple spot inspection module high speed address stored figure place are two parameters that can suitably adjust according to the practical problems scale.Wherein, the coding figure place of digital coding module should be slightly larger than the numerical value quantity of final output usually, to guarantee numerical coding enough discriminations is arranged; The address size of high speed storing (as the addressing space of calculator memory) need to have determined the capacity of the high speed storing of use.
The present invention can handle large-scale data problem efficiently.Because through code conversion fast, data space is compressed, and the continuation address space that finally check realizes data all are mapped to high speed storing through multiple spot.On the one hand, the extensive compression of data space makes needs the canned data amount also correspondingly to be reduced, thereby makes the information that whether all examples of storage occurred on single machine become possibility.On the other hand, such compression makes that also handling the needed time of each data significantly reduces, and overall system efficiency also so significantly promotes.
Description of drawings
Fig. 1 is a system architecture diagram of the present invention.
Fig. 2 is a workflow diagram of the present invention.
Embodiment
Below in conjunction with accompanying drawing embodiments of the invention are elaborated: present embodiment has provided detailed embodiment being to implement under the prerequisite with the technical solution of the present invention, but protection scope of the present invention is not limited to following embodiment.
To judge that whether network address is example for repeating problem.Comprise a large amount of website information in the webpage that the reptile of process search engine is collected, and may repeatedly point to the webpage of same network address.Therefore be necessary that these website information are arranged heavy industry to be done, in order to avoid the search engine same page or leaf of throwing the net of index repeatedly.
As shown in Figure 1, present embodiment comprises four modules: load module, digital coding module, multiple spot inspection module, output module.Various data lines or mainboard by computing machine between each module carry out communication.Wherein:
Described load module connects need arrange heavy raw data, i.e. data instance;
Described digital coding module carries out recompile with the data instance that load module obtains, with data space be compressed to the scale of corresponding problem quite or big slightly space;
Described multiple spot inspection module is repeatedly sampled to new coding, and sets up mapping relations with the address of high speed storing, write down by the zone bit of controlling the high speed storing appropriate address some data instances whether to occur, thus the heavy function of the row of realization;
Described output module will return to the user through the heavy data of row.
As shown in Figure 2, when present embodiment is started working, meet the user by load module earlier and need arrange heavy raw data, digital coding module carries out recompile to data instance then, with data space be compressed to the scale of corresponding problem quite or big slightly space, the multiple spot inspection module is repeatedly sampled to new coding, and set up mapping relations with the address of high speed storing, zone bit by control high speed storing appropriate address writes down and some data instances whether occurred, if, then elimination, if not, then return to the user by output module, this is arranged heavy industry and stops.
In the present embodiment, load module can be an input equipment (such as hard disc of computer); Load module can also be made up of other equipment such as network interface cards, mainly realizes the batch read functions of data, is used to receive to arrange heavy data, and in the present embodiment, data directly obtain by the input equipment input.
In the present embodiment, digital coding module is a software processes program that mainly takies central processing unit (CPU); Digital coding module adopts the hash algorithm of MD5, changes each network address into 128 01 word string rapidly, with the data space of network address from 10 180Be compressed to 2 128(10 38), both the first mate had reduced data space, left certain leeway again, guaranteed in predictable a period of time, and network address quantity can not surpass the data total amount that this space can hold.
In the present embodiment, the multiple spot inspection module is a software processes program that mainly takies calculator memory;
The multiple spot inspection module is decomposed into 4 32 01 string with 128 MD5 values of each network address, makes data space further to narrow down to 2 32(about 4,300,000,000) position.Use the 512MB internal memory just can set up mapping relations one to one.
For example, www.a.com calculates with the network address http://, and its MD5 value is 1,101 0,001 0,001 10,110,111 0,000 0,111 0,001 0,101 1,010 1,110 0,110 1,111 1,111 0,110 0,101 1,001 0,100 01,101,000 0,000 0,110 1,110 1,111 1,100 1,100 1,110 1,110 1,000 0,011 1,000 1001.
Through decomposing, become 1,101 0,001 0,001 1,011 0,111 0,000 0,111 0001 (D11B7071H), 0,101 1,010 1,110 0,110 1,111 1,111 0,110 0101 (5AE6FF65H), 1,001 0,100 0,110 10,000,000 0,110 1,110 1111 (946806EFH) and 1,100 1,100 1,110 1,110 1,000 001110001001 (CCEE8389H).
Therefore can use D11B7071H, 5AE6FF65H, 946806EFH and CCEE8389H in the internal memory to come this network address of mark whether to occur.Data with 4 memory addresss when occurring first are changed to 1.When later on occurring this network address once more, as long as find that the data of these 4 addresses are 1 and just can be judged to be this network address and occur.
In the present embodiment, output module is an output device (such as hard disc of computer).
Present embodiment is handled row's heavy industry of 1,700 ten thousand network address and is done in the PC system of an AMD 4400+ double-core CPU, 8GB DDR2 internal memory, AMD 780G mainboard, 1TB hard disk, approximately needs to finish for tens seconds.

Claims (8)

1, a kind of system for processing repeat arrangement of large scale data information is characterized in that comprising four modules: load module, digital coding module, multiple spot inspection module, output module, wherein:
Described load module receives need arrange heavy raw data, i.e. data instance;
Described digital coding module carries out recompile with the data instance that load module obtains, and the scale that data space is compressed to and waits separate problem real data example quite or big slightly space;
Described multiple spot inspection module is repeatedly sampled to new coding, and sets up mapping relations with the address of high speed storing, write down by the zone bit of controlling the high speed storing appropriate address some data instances whether to occur, thereby the row of realization is heavy;
Described output module will return to the user through the heavy data of row.
2, system for processing repeat arrangement of large scale data information according to claim 1, it is characterized in that, described load module is responsible for receiving the user's data example, data instance is file input or network flow input, these data instances do not have specific sequence requirement, a data instance may repeatedly occur when receiving, and the centre is mingled with other data instances.
3, system for processing repeat arrangement of large scale data information according to claim 1, it is characterized in that, described digital coding module adopts hash function to realize the fast coding conversion of data instance, and with the data compression in the luv space to by the numerical value of k position 01 sequence, the coding figure place of digital coding module needs the numerical value quantity greater than final output, to guarantee numerical coding enough discriminations is arranged.
4, system for processing repeat arrangement of large scale data information according to claim 3 is characterized in that, the value of described k is 16 or 32 integral multiple.
5, system for processing repeat arrangement of large scale data information according to claim 4 is characterized in that, the value of described k is 128 or 160.
6, system for processing repeat arrangement of large scale data information according to claim 1, it is characterized in that, described multiple spot inspection module, the multiple spot corresponding relation is set up in the numerical value that will be generated by digital coding module and the address of high speed storing, and write down the module that some data instances whether occurred by the zone bit of control high speed storing appropriate address, it can be inquired about in the high-speed storage device particular address position rapidly and whether be mark and come data are judged whether to occur.
According to claim 1 or 6 described system for processing repeat arrangement of large scale data information, it is characterized in that 7, described multiple spot inspection module when the relatively little problem of handling problem scale, can be reduced to the single-point check.
8, system for processing repeat arrangement of large scale data information according to claim 1 is characterized in that, described output module is declared the heavy data instance that did not repeat later with the multiple spot inspection module and finally returned to the user.
CNA2008102034399A 2008-11-27 2008-11-27 System for processing repeat arrangement of large scale data information Pending CN101414309A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNA2008102034399A CN101414309A (en) 2008-11-27 2008-11-27 System for processing repeat arrangement of large scale data information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNA2008102034399A CN101414309A (en) 2008-11-27 2008-11-27 System for processing repeat arrangement of large scale data information

Publications (1)

Publication Number Publication Date
CN101414309A true CN101414309A (en) 2009-04-22

Family

ID=40594843

Family Applications (1)

Application Number Title Priority Date Filing Date
CNA2008102034399A Pending CN101414309A (en) 2008-11-27 2008-11-27 System for processing repeat arrangement of large scale data information

Country Status (1)

Country Link
CN (1) CN101414309A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101859316A (en) * 2010-04-29 2010-10-13 北京无限立通通讯技术有限责任公司 Method and device for mass file access
CN106960052A (en) * 2017-03-31 2017-07-18 深圳微众税银信息服务有限公司 A kind of collage-credit data acquisition method and system

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101859316A (en) * 2010-04-29 2010-10-13 北京无限立通通讯技术有限责任公司 Method and device for mass file access
CN101859316B (en) * 2010-04-29 2012-07-11 北京无限立通通讯技术有限责任公司 Method and device for mass file access
CN106960052A (en) * 2017-03-31 2017-07-18 深圳微众税银信息服务有限公司 A kind of collage-credit data acquisition method and system
CN106960052B (en) * 2017-03-31 2020-09-15 深圳微众信用科技股份有限公司 Credit investigation data acquisition method and system

Similar Documents

Publication Publication Date Title
CN100485689C (en) Data speedup query method based on file system caching
JP6388655B2 (en) Generation of multi-column index of relational database by data bit interleaving for selectivity
EP2443564B1 (en) Data compression for reducing storage requirements in a database system
EP2946333B1 (en) Efficient query processing using histograms in a columnar database
CN110297879B (en) Method, device and storage medium for data deduplication based on big data
CN103714096A (en) Lucene-based inverted index system construction method and device, and Lucene-based inverted index system data processing method and device
CN112100219B (en) Report generation method, device, equipment and medium based on database query processing
CN103345484A (en) Report form processing system based on dynamic domain and method
CN109471893B (en) Network data query method, equipment and computer readable storage medium
CN107729406A (en) A kind of data classification storage method and device
CN106649368A (en) Data storage method and device and data query method and device
CN109901978A (en) A kind of Hadoop log lossless compression method and system
CN101414309A (en) System for processing repeat arrangement of large scale data information
CN112307004B (en) Data management method, device, equipment and storage medium
CN113157853A (en) Problem mining method and device, electronic equipment and storage medium
EP3097644A1 (en) Optimized data condenser and method
CN110399396B (en) Efficient data processing
CN112328641B (en) Multi-dimensional data aggregation method and device and computer equipment
CN109902851B (en) Method and device for determining production plan
CN113722296A (en) Agricultural information processing method and device, electronic equipment and storage medium
CN106528718A (en) Method and device for processing data from third party
CN112328960B (en) Optimization method and device for data operation, electronic equipment and storage medium
US20220318284A1 (en) Systems and methods for query term analytics
CN114442911B (en) System and method for asynchronous input/output scanning and aggregation for solid state drives
US20240134779A1 (en) System and method for automated test case generation based on queuing curve analysis

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Open date: 20090422