CN101414309A

CN101414309A - System for processing repeat arrangement of large scale data information

Info

Publication number: CN101414309A
Application number: CNA2008102034399A
Authority: CN
Inventors: 韩定一; 周云庆; 袁若石; 薛贵荣; 俞勇
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2008-11-27
Filing date: 2008-11-27
Publication date: 2009-04-22

Abstract

The invention relates to a large scale data information filter processing system in the technical field of information processing which comprises an inputting module, a digital coding module, a multi-point detecting module and an output module; the input module receives the original data needing to be filtered, namely a data example; the digital coding module carries out re-coding on the data example obtained by the input module and compresses a data space to a space equal to or a little larger than the scale of the actual data example of a problem to be solved; the multi-point detecting module samples new codes for a plurality of times, builds a mapping relation with an address which is memorized at a high speed, records whether a certain data example appears or not by controlling the marking position of the corresponding address that is memorized at a high speed, thereby realizing the function of filter; the output module returns the filtered data to a user. The large scale data information filter processing system has the characteristics of effectiveness, large data processing volume, and the like.

Description

System for processing repeat arrangement of large scale data information

Technical field

What the present invention relates to is a kind of system of technical field of information processing, and specifically, what relate to is a kind of system for processing repeat arrangement of large scale data information.

Background technology

Along with the continuous development of the information processing technology, application scenes occurring need investigate the work that repeats to mass data.For example: need in search engine system to judge which webpage is embodied in the system.Because the webpage One's name is legion on the internet need have special system that newfound network address is judged, look at whether it has been climbed to get and the index mistake, if exist, perhaps need the follow-up work of upgrading index; And if not existence is as yet so just carried out the follow-up work of newly-built index possibly.And for example:, need analyze comparison to a large amount of gene informations in some bio-science research fields.Need also to judge whether gene information had done corresponding the processing, and carry out different follow-up works.And aspect telecommunication service, also need the record data of about tens various telecommunication services are judged the work of duplicate record, in order to avoid the situation of overcharge appears.

Data in these are used have following 3 denominators.

At first, data space is very big.With this class data instance of network address, generally speaking, it is made up of numeral, letter (capital and small letter is relevant), "-" and ". ", generally is no more than 100 characters (in fact Kuo Zhan network address can be made up of nearly all ascii character, and can reach 2000 characters).Such data may always have 64 ¹⁰⁰(about 10 ¹⁸⁰) individual.General System can't be handled so huge data space.

Secondly, the shared space of real data may not be full of whole data space, and perhaps the actual amount of data that may occur in a task is not to be full of whole data space.Still with network address as an example, according to China Internet Network Information Center's statistics, the website quantity of China is on 1,000,000 these orders of magnitude.And the index pages quantity of commercial search engine is greatly on 10,000,000,000 these orders of magnitude.With respect to 10 ¹⁸⁰Individual possible network address, 10,000,000,000 are actually a very little numeral.On the order of magnitude, differ from 10 ¹⁷⁰We can say that data are very sparse.

At last, the shared space of real data is again the task of being difficult to finish for present computer system.Calculate with 10,000,000,000 network address, suppose that average each network address length is 30 characters (4 bytes of each character), storing these network address needs about 1.2PB space altogether, is about 1,000 times of present main flow hard drive space capacity (500GB-1TB).And 10,000,000,000 records want (millisecond rank) inquiry of response rapidly to judge that whether a network address has existed also is almost impossible for the unit Database Systems.Therefore present existing solution mostly adopts distributed architecture, and storage, index and computational load are distributed on hundreds and thousands of the machines.Accomplish a task jointly by the some small-scale problems of parallel processing by cutting.Because involve the network communication and the work stationary problem of a large amount of machines, the stability of system and reliability are not very good.

Through the literature search of prior art is found, Chinese patent application " based on the mass toll-ticket fast cross rearrangement of internal memory " (publication number CN1897629) proposes a kind ofly to adopt multi-level storage machine system based on internal memory, based on the index technology of y-bend balanced tree and key tree, based on the compress technique of binary-coded decimal and RLC algorithm and the process that combines based on the cross rearrangement of timeslice.Went up 45240988 records of processing time-consuming 4467 seconds at IBM P650 (16 1.5GHz CPU, 32GB RAM).Resurvey about 20 times of raising speed the 86669 seconds used time of examination with respect to traditional intersection row based on database.However, it has still adopted the third level memory mechanism based on hard disk, and this mechanism still can be brought influence to the further lifting of system speed.

Summary of the invention

The objective of the invention is at the deficiencies in the prior art, a kind of system for processing repeat arrangement of large scale data information is provided, it can efficiently handle the information of extensive possibility repeating data, and its output does not have repeating data, and the efficient of bringing with the versatility design that overcomes legacy system reduces problem.

The present invention is achieved by the following technical solutions, the present invention includes four modules: load module, digital coding module, multiple spot inspection module, output module.Wherein:

Described load module receives need arrange heavy raw data, i.e. data instance;

Described digital coding module carries out recompile with the data instance that load module obtains, and the scale that data space is compressed to and waits separate problem real data example quite or big slightly space;

Described multiple spot inspection module is repeatedly sampled to new coding, and sets up mapping relations with the address of high speed storing, write down by the zone bit of controlling the high speed storing appropriate address some data instances whether to occur, thus the heavy function of the row of realization;

Described output module will return to the user through the heavy data of row.

Described load module is responsible for receiving the user's data example, can be the file input, network flow input or the like, for example: receive the various network address of from webpage, finding.These data instances do not have specific sequence requirement, and a data instance may repeatedly occur when receiving, and the centre is mingled with other data instances.

Described digital coding module carries out the fast coding conversion with data instance, can adopt hash functions such as MD5, SHA-1 to realize.And with the data compression in the luv space to by the numerical space of k position 01 sequence.The coding figure place of digital coding module should be slightly larger than the numerical value quantity of final output.Wherein, the value of k is 16 or 32 integral multiple normally, need carry out suitable parameter according to practical problems and select.Relatively Chang Yong k value can be 128 or 160.

Described multiple spot inspection module, the numerical value that will be generated by digital coding module and the address of high speed storing are set up the multiple spot corresponding relation and (being comprised: multiple spot corresponding relation and single-point corresponding relation.As: several parts that will newly encode are mapped as the address of high speed storing respectively, thereby a new coding are mapped to a plurality of memory locations of high speed storing.Or will newly encode directly be mapped as the address of high speed storing), and write down the module that some data instances whether occurred by the zone bit of control high speed storing appropriate address.It can inquire about rapidly in the high-speed storage device particular address position whether all the mark judgment data of coming whether occurred.In addition, the problem scale of handling when needs relatively hour can be reduced to the multiple spot check single-point check, with further elevator system performance.

Described output module, with the multiple spot inspection module declare heavy after, the data instance arrangement that did not repeat also finally returns to the user.

The coding figure place of the digital coding module among the present invention and multiple spot inspection module high speed address stored figure place are two parameters that can suitably adjust according to the practical problems scale.Wherein, the coding figure place of digital coding module should be slightly larger than the numerical value quantity of final output usually, to guarantee numerical coding enough discriminations is arranged; The address size of high speed storing (as the addressing space of calculator memory) need to have determined the capacity of the high speed storing of use.

The present invention can handle large-scale data problem efficiently.Because through code conversion fast, data space is compressed, and the continuation address space that finally check realizes data all are mapped to high speed storing through multiple spot.On the one hand, the extensive compression of data space makes needs the canned data amount also correspondingly to be reduced, thereby makes the information that whether all examples of storage occurred on single machine become possibility.On the other hand, such compression makes that also handling the needed time of each data significantly reduces, and overall system efficiency also so significantly promotes.

Description of drawings

Fig. 1 is a system architecture diagram of the present invention.

Fig. 2 is a workflow diagram of the present invention.

Embodiment

Below in conjunction with accompanying drawing embodiments of the invention are elaborated: present embodiment has provided detailed embodiment being to implement under the prerequisite with the technical solution of the present invention, but protection scope of the present invention is not limited to following embodiment.

To judge that whether network address is example for repeating problem.Comprise a large amount of website information in the webpage that the reptile of process search engine is collected, and may repeatedly point to the webpage of same network address.Therefore be necessary that these website information are arranged heavy industry to be done, in order to avoid the search engine same page or leaf of throwing the net of index repeatedly.

As shown in Figure 1, present embodiment comprises four modules: load module, digital coding module, multiple spot inspection module, output module.Various data lines or mainboard by computing machine between each module carry out communication.Wherein:

Described load module connects need arrange heavy raw data, i.e. data instance;

Described digital coding module carries out recompile with the data instance that load module obtains, with data space be compressed to the scale of corresponding problem quite or big slightly space;

Described output module will return to the user through the heavy data of row.

As shown in Figure 2, when present embodiment is started working, meet the user by load module earlier and need arrange heavy raw data, digital coding module carries out recompile to data instance then, with data space be compressed to the scale of corresponding problem quite or big slightly space, the multiple spot inspection module is repeatedly sampled to new coding, and set up mapping relations with the address of high speed storing, zone bit by control high speed storing appropriate address writes down and some data instances whether occurred, if, then elimination, if not, then return to the user by output module, this is arranged heavy industry and stops.

In the present embodiment, load module can be an input equipment (such as hard disc of computer); Load module can also be made up of other equipment such as network interface cards, mainly realizes the batch read functions of data, is used to receive to arrange heavy data, and in the present embodiment, data directly obtain by the input equipment input.

In the present embodiment, digital coding module is a software processes program that mainly takies central processing unit (CPU); Digital coding module adopts the hash algorithm of MD5, changes each network address into 128 01 word string rapidly, with the data space of network address from 10 ¹⁸⁰Be compressed to 2 ¹²⁸(10 ³⁸), both the first mate had reduced data space, left certain leeway again, guaranteed in predictable a period of time, and network address quantity can not surpass the data total amount that this space can hold.

In the present embodiment, the multiple spot inspection module is a software processes program that mainly takies calculator memory;

The multiple spot inspection module is decomposed into 4 32 01 string with 128 MD5 values of each network address, makes data space further to narrow down to 2 ³²(about 4,300,000,000) position.Use the 512MB internal memory just can set up mapping relations one to one.

For example, www.a.com calculates with the network address http://, and its MD5 value is 1,101 0,001 0,001 10,110,111 0,000 0,111 0,001 0,101 1,010 1,110 0,110 1,111 1,111 0,110 0,101 1,001 0,100 01,101,000 0,000 0,110 1,110 1,111 1,100 1,100 1,110 1,110 1,000 0,011 1,000 1001.

Through decomposing, become 1,101 0,001 0,001 1,011 0,111 0,000 0,111 0001 (D11B7071H), 0,101 1,010 1,110 0,110 1,111 1,111 0,110 0101 (5AE6FF65H), 1,001 0,100 0,110 10,000,000 0,110 1,110 1111 (946806EFH) and 1,100 1,100 1,110 1,110 1,000 001110001001 (CCEE8389H).

Therefore can use D11B7071H, 5AE6FF65H, 946806EFH and CCEE8389H in the internal memory to come this network address of mark whether to occur.Data with 4 memory addresss when occurring first are changed to 1.When later on occurring this network address once more, as long as find that the data of these 4 addresses are 1 and just can be judged to be this network address and occur.

In the present embodiment, output module is an output device (such as hard disc of computer).

Present embodiment is handled row's heavy industry of 1,700 ten thousand network address and is done in the PC system of an AMD 4400+ double-core CPU, 8GB DDR2 internal memory, AMD 780G mainboard, 1TB hard disk, approximately needs to finish for tens seconds.

Claims

1, a kind of system for processing repeat arrangement of large scale data information is characterized in that comprising four modules: load module, digital coding module, multiple spot inspection module, output module, wherein:

Described load module receives need arrange heavy raw data, i.e. data instance;

Described multiple spot inspection module is repeatedly sampled to new coding, and sets up mapping relations with the address of high speed storing, write down by the zone bit of controlling the high speed storing appropriate address some data instances whether to occur, thereby the row of realization is heavy;

Described output module will return to the user through the heavy data of row.

2, system for processing repeat arrangement of large scale data information according to claim 1, it is characterized in that, described load module is responsible for receiving the user's data example, data instance is file input or network flow input, these data instances do not have specific sequence requirement, a data instance may repeatedly occur when receiving, and the centre is mingled with other data instances.

3, system for processing repeat arrangement of large scale data information according to claim 1, it is characterized in that, described digital coding module adopts hash function to realize the fast coding conversion of data instance, and with the data compression in the luv space to by the numerical value of k position 01 sequence, the coding figure place of digital coding module needs the numerical value quantity greater than final output, to guarantee numerical coding enough discriminations is arranged.

4, system for processing repeat arrangement of large scale data information according to claim 3 is characterized in that, the value of described k is 16 or 32 integral multiple.

5, system for processing repeat arrangement of large scale data information according to claim 4 is characterized in that, the value of described k is 128 or 160.

6, system for processing repeat arrangement of large scale data information according to claim 1, it is characterized in that, described multiple spot inspection module, the multiple spot corresponding relation is set up in the numerical value that will be generated by digital coding module and the address of high speed storing, and write down the module that some data instances whether occurred by the zone bit of control high speed storing appropriate address, it can be inquired about in the high-speed storage device particular address position rapidly and whether be mark and come data are judged whether to occur.

According to claim 1 or 6 described system for processing repeat arrangement of large scale data information, it is characterized in that 7, described multiple spot inspection module when the relatively little problem of handling problem scale, can be reduced to the single-point check.

8, system for processing repeat arrangement of large scale data information according to claim 1 is characterized in that, described output module is declared the heavy data instance that did not repeat later with the multiple spot inspection module and finally returned to the user.