CN110147357A

CN110147357A - The multi-source data polymerization methods of sampling and system under a kind of environment based on big data

Info

Publication number: CN110147357A
Application number: CN201910373940.8A
Authority: CN
Inventors: 云本胜; 钱亚冠; 胡月
Original assignee: Zhejiang Lover Health Science and Technology Development Co Ltd
Current assignee: Zhejiang Lover Health Science and Technology Development Co Ltd; Zhejiang University of Science and Technology ZUST
Priority date: 2019-05-07
Filing date: 2019-05-07
Publication date: 2019-08-20

Abstract

The invention belongs to big data technical field, the multi-source data polymerization methods of sampling and system under a kind of environment based on big data are disclosed, acquires multiple original data sources, each original data source includes data source name and at least one associated domain；The data source of acquisition is cleaned, identified, removes redundant operation；Using construction procedures according to original data source, original strategy list is obtained, the original strategy in original strategy list is ranked up, forms Policy List between data source；Separate sources data set is subjected to fusion treatment using fusion program；Fused file is segmented, the two-dimentional frequency matrix of file word is formed；The balanced verification numerical value of setting, circulation carry out roll-the-snowball sampling to each word；Utilize the multi-source data of display display acquisition.The present invention is dispatched by preprocessing module calculate node by Spark, is completed distributed computing, be can be realized more efficiently data prediction, practical, applied widely.

Description

The multi-source data polymerization methods of sampling and system under a kind of environment based on big data

Technical field

It polymerize the invention belongs to the multi-source data under big data technical field more particularly to a kind of environment based on big data and takes out Quadrat method and system.

Background technique

Multisource data fusion technology, which refers to, is all integrated into one for all information investigated, analysis is got using correlation means It rises, and carries out unified evaluation to information, finally obtain the technology of unified information.The purpose that the technical research comes out is will be each Then the characteristics of different data information of kind is integrated, draws different data sources therefrom extracts unification, than single data More preferably, richer information.However, in multi-source data polymerization sampling process under existing big data environment, to structural data, For semi-structured, non-structured data prediction research is insufficient, and usually only comprising data acquisition and data cleansing two A module, and the method for data cleansing is also fairly simple, cannot meet user demand well；Meanwhile when the fusion of data, There is no open link data set as priori knowledge, efficiently and accurately can not be carried out on a large scale in the case where reducing more complicated degree The fusion of heterogeneous data source.

In conclusion problem of the existing technology is:

In multi-source data polymerization sampling process under existing big data environment, to structural data, for semi-structured, non- The data prediction of structuring studies deficiency, and usual includes data acquisition and two modules of data cleansing, and data The method of cleaning is also fairly simple, cannot meet user demand well；Meanwhile when the fusion of data, there is no open link number It is used as priori knowledge according to collection, melting for large scale scale heterogeneous data source can not be carried out by efficiently and accurately in the case where reducing more complicated degree It closes.

Summary of the invention

In view of the problems of the existing technology, the present invention provides the multi-source data polymerizations under a kind of environment based on big data The methods of sampling and system.

The invention is realized in this way the multi-source data under a kind of environment based on big data polymerize the methods of sampling, the base Multi-source data under big data environment polymerize the methods of sampling

Separate sources data set is subjected to fusion treatment using fusion program by data fusion module；It is multiple next merging When the solid data in source, canonical representation is carried out to the attribute of each data source respectively, which includes the mapping of synonymous attribute and Unified conversion to the numerical value unit of attribute value；Piecemeal polymerization is carried out to entity based on physical name and entity attribute；By same point The entity of separate sources will be matched as candidate entity pair using the similarity between entity alignment algorithm computational entity in block The entity pair that same objective world is described into separate sources establishes the link of equal value of same entity between different data sources, and The merging of entity attribute is carried out, and for entity exclusive in a data source, it is directly appended in knowledge base；

Fused file is segmented by word segmentation module, forms the two-dimentional frequency matrix of file word；

s.t.X_i=X_iA_i+E_i, i=1 ..., K

Wherein α is greater than 0 coefficient,For measuring normal word and abnormal word participle Bring error；

It is equivalent to drag:

s.t.X_i=X_iS_i+E_i,

A_i=J_i,

A_i=S_i, i=1 ..., K

Further, the multi-source data polymerization methods of sampling under the environment based on big data further comprises:

Step 1 acquires multiple original data sources by data source acquisition module, and each original data source includes data source Title and at least one associated domain；

Step 2, central control module are carried out by preprocessing module using data source of the data processor to acquisition clear It washes, identify, remove redundant operation；

Step 3 constructs module using construction procedures according to original data source by Policy List, obtains original strategy column Table is ranked up the original strategy in original strategy list, forms Policy List between data source；

Separate sources data set is carried out fusion treatment using fusion program by data fusion module by step 4；

Step 5 is segmented fused file by word segmentation module, forms the two-dimentional frequency matrix of file word；

Step 6 chooses the seed root node key words of datum target guiding by decimation blocks using sample program, Roll-the-snowball sampling depth is inputted, on the basis of seed root node data, sets balanced verification numerical value, circulation is to each word Language carries out roll-the-snowball sampling；

Step 7 utilizes the multi-source data of display display acquisition by display module.

Further, preprocessing module processing method includes:

(1) distributed file system HDFS is uploaded to according to the data in preset condition extraction heterogeneous data source to be deposited Storage；

(2) data in distributed file system HDFS are loaded by memory using Spark frame, remove repeated data, Noise data carries out format conversion operation；

(3) to the data after cleaning, the different representation methods of the same entity are identified, correctly identify out it is all not Same entity merges the data of same entity；

(4) data de-duplication technology based on cryptographic Hash is used, redundant data is removed.

Further, in the step (1), structuring, semi-structured, unstructured big number are read from heterogeneous data source According to uploading to distributed file system HDFS and stored；

The format of the heterogeneous data source includes: Txt, Csv, Xsl, database data, jpg, mp4, and provides interface mark Standard is to extend source of new data；

Textual data is read from text file by designing text storage function for text file, including Txt, Csv According to storage is into distributed file system HDFS；

For Xsl file, by designing Xsl storage function, excel data, storage to distribution are read from Excel file In formula file system HDFS；

For database data, including MySQL, Oracle, by database access interface ODBC or JDBC from database Middle reading is stored into distributed file system HDFS；

Other kinds of file, including jpg, mp4 are read corresponding by designing corresponding file storage function Data in data source are stored into distributed file system HDFS.

Further, in the step (2), the data cleansing, which refers to, handles frame based on Spark big data, will be distributed Data in file system HDFS are loaded into memory, are denoised, duplicate removal, format conversion operation, and detailed process includes:

It reads data: data model is established based on SparkRDD/DataFrame, read the data in HDFS file, conversion For RDD/DataFrame；

Repeated data: the data that read step generates is removed, removes repetition by design function or using built-in function Data；

It removes noise data: freely configuring for combination condition judgment rule is realized using regulation engine, reduce or remove Noise data, and effective information is avoided to lose；

Format conversion is carried out, converts unified format for the data of different-format.

Further, the progress canonical representation includes the method for normalizing of logarithm type attribute and date type attribute, described The attribute value of date type attribute is collectively expressed as XX XX month XX day, and the specification of the attribute value of Numeric Attributes is mainly wrapped Numerical value conversion and unit one or two of step of system are included, numerical value conversion refers to kilobit separator, the Chinese capitalization number in original numerical value Situations such as word, is completely converted into Arabic numerals, and unit is unified then between progress numerical value conversion the not commensurate under same category；

It is described that piecemeal polymerization is carried out to entity based on physical name and entity attribute, it is necessary first to piecemeal is carried out to entity, it will Consistent entity may be directed toward to being put into same, then using the entity of separate sources in same as candidate matches entity It is right, compare whether the entity in different data sources is same reference two-by-two；

The piecemeal is grouped polymerization to entity using the partition strategy based on entity name and entity attribute, and described point The specific process of group polymerization is, first according to entity name, entity name to be decomposed into binary model sequence；Secondly, for Key value of the item as inverted index in each binary model sequence, which is inserted into the corresponding inverted index of this； Then, it by the corresponding entity of key value each in inverted index, is divided again according to entity attribute, finally, if two differences The entity in source has more than two identical attributes and attribute value, then is subdivided into same piecemeal.

Another object of the present invention is to provide polymerize to take out based on the multi-source data under big data environment described in a kind of realize The information data processing terminal of quadrat method.

Another object of the present invention is to provide a kind of computer readable storage mediums, including instruction, when it is in computer When upper operation, so that polymerizeing the methods of sampling based on the multi-source data under big data environment described in computer execution.

Another object of the present invention is to provide polymerize to take out based on the multi-source data under big data environment described in a kind of implementation Quadrat method polymerize sampling system based on the multi-source data under big data environment, the multi-source data under the environment based on big data Polymerizeing sampling system includes:

Data source acquisition module, connect with central control module, for acquiring multiple original data sources, each initial data Source includes data source name and at least one associated domain；

Central control module constructs module, data fusion mould with data source acquisition module, preprocessing module, Policy List Block, word segmentation module, decimation blocks, display module connection, work normally for controlling modules by central processing unit；

Preprocessing module is connect with central control module, for being carried out by data source of the data processor to acquisition Cleaning, identification, removal redundant operation；

Policy List constructs module, connect with central control module, for, according to original data source, being obtained by construction procedures Original strategy list is taken, the original strategy in original strategy list is ranked up, forms Policy List between data source；

Data fusion module is connect with central control module, for being carried out separate sources data set by fusion program Fusion treatment；

Word segmentation module is connect with central control module, for segmenting fused file, forms file word Two-dimentional frequency matrix；

Decimation blocks are connect with central control module, for choosing the seminal root of datum target guiding by sample program Node key words input roll-the-snowball sampling depth, on the basis of seed root node data, set balanced verification numerical value, follow Ring carries out roll-the-snowball sampling to each word；

Display module is connect with central control module, for the multi-source data by display display acquisition.

Another object of the present invention is to provide polymerize to take out based on the multi-source data under big data environment described in a kind of implementation Quadrat method polymerize sampling Cloud Server based on the multi-source data under big data environment.

Advantages of the present invention and good effect are as follows:

The present invention handles frame using Spark big data by preprocessing module and pre-processes to big data, not only may be used To reduce storage resource and network bandwidth, data storage efficiency is improved, and can be improved the quality of subsequent data analysis work. Spark frame supports that datarams are resident, can be improved read or write speed by building elasticity distribution formula data set (RDD) structure. Calculate node is dispatched by Spark, is completed distributed computing, be can be realized more efficiently data prediction, practical, is applicable in model It encloses wide；Meanwhile the data of needs can be extracted from the knowledge base of multiple and different field different scales by data fusion module Partial data source needed for constituting support applications by fusion, the data fusion of multiple data sources is got up, and merges redundancy, Expand useful information.

Separate sources data set is carried out fusion treatment using fusion program by data fusion module by the present invention；It is merging When the solid data in multiple sources, canonical representation is carried out to the attribute of each data source respectively, which includes synonymous attributes The unified conversion of mapping and the numerical value unit to attribute value；Piecemeal polymerization is carried out to entity based on physical name and entity attribute；It will The entity of separate sources will using the similarity between entity alignment algorithm computational entity as candidate entity pair in same piecemeal Matching obtains the entity pair that same objective world is described in separate sources, establishes the chain of equal value of same entity between different data sources It connects, and carries out the merging of entity attribute, and for entity exclusive in a data source, it is directly appended in knowledge base；

s.t.X_i=X_iA_i+E_i, i=1 ..., K

It is equivalent to drag:

s.t.X_i=X_iS_i+E_i,

A_i=J_i,

A_i=S_i, i=1 ..., K

Large scale scale heterogeneous data can not be carried out by efficiently and accurately in the case where reducing more complicated degree by solving the prior art The fusion problem in source.

Detailed description of the invention

Fig. 1 is the multi-source data polymerization methods of sampling flow chart under the environment provided in an embodiment of the present invention based on big data.

Fig. 2 is the multi-source data polymerization sampling system structural frames under the environment provided in an embodiment of the present invention based on big data Figure.

In figure: 1, data source acquisition module；2, central control module；3, preprocessing module；4, Policy List constructs module； 5, data fusion module；6, word segmentation module；7, decimation blocks；8, display module.

Specific embodiment

In order to further understand the content, features and effects of the present invention, the following examples are hereby given, and cooperate attached drawing Detailed description includes.

Structure of the invention is explained in detail with reference to the accompanying drawing.

As shown in Figure 1, the multi-source data polymerization methods of sampling under the environment provided by the invention based on big data includes following Step:

S101 acquires multiple original data sources by data source acquisition module, and each original data source includes dataSource link Claim and at least one associated domain；

S102, central control module are carried out by preprocessing module using data source of the data processor to acquisition clear It washes, identify, remove redundant operation；

S103 constructs module using construction procedures according to original data source by Policy List, obtains original strategy list, Original strategy in original strategy list is ranked up, Policy List between data source is formed；

Separate sources data set is carried out fusion treatment using fusion program by data fusion module by S104；

S105 is segmented fused file by word segmentation module, forms the two-dimentional frequency matrix of file word；

S106 chooses the seed root node key words of datum target guiding by decimation blocks using sample program, defeated Enter roll-the-snowball sampling depth, on the basis of seed root node data, set balanced verification numerical value, circulation to each word, Carry out roll-the-snowball sampling；

S107 utilizes the multi-source data of display display acquisition by display module.

In step S104, separate sources data set is carried out using fusion program by fusion treatment by data fusion module； When merging the solid data in multiple sources, canonical representation is carried out to the attribute of each data source respectively, which includes same The unified conversion of adopted attribute mapping and the numerical value unit to attribute value；It is poly- that piecemeal is carried out to entity based on physical name and entity attribute It closes；Using the entity of separate sources in same piecemeal as candidate entity pair, using similar between entity alignment algorithm computational entity Matching is obtained the entity pair for describing same objective world in separate sources, establishes same entity between different data sources by degree Equivalence link, and the merging of entity attribute is carried out, and for entity exclusive in a data source, it is directly appended to knowledge base In；

s.t.X_i=X_iA_i+E_i, i=1 ..., K

It is equivalent to drag:

s.t.X_i=X_iS_i+E_i,

A_i=J_i,

A_i=S_i, i=1 ..., K

As shown in Fig. 2, the multi-source data under the environment provided in an embodiment of the present invention based on big data polymerize sampling system packet Include: data source acquisition module 1, central control module 2, preprocessing module 3, Policy List building module 4, data fusion module 5, Word segmentation module 6, decimation blocks 7, display module 8.

Data source acquisition module 1 is connect with central control module 2, for acquiring multiple original data sources, each original number It include data source name and at least one associated domain according to source；

Central control module 2 melts with data source acquisition module 1, preprocessing module 3, Policy List building module 4, data Block 5, word segmentation module 6, decimation blocks 7, display module 8 is molded to connect, it is normal for controlling modules by central processing unit Work；

Preprocessing module 3 is connect with central control module 2, for by data processor to the data source of acquisition into Row cleaning, identification, removal redundant operation；

Policy List constructs module 4, connect with central control module 2, for passing through construction procedures according to original data source, Original strategy list is obtained, the original strategy in original strategy list is ranked up, forms Policy List between data source；

Data fusion module 5 is connect with central control module 2, for by fusion program by separate sources data set into Row fusion treatment；

Word segmentation module 6 is connect with central control module 2, for segmenting fused file, forms file word Two-dimentional frequency matrix；

Decimation blocks 7 are connect with central control module 2, for choosing the seed of datum target guiding by sample program Root node key words input roll-the-snowball sampling depth, on the basis of seed root node data, set balanced verification numerical value, Circulation carries out roll-the-snowball sampling to each word；

Display module 8 is connect with central control module 2, for the multi-source data by display display acquisition.

3 processing method of preprocessing module provided by the invention includes:

In step (1) provided by the invention, structuring, semi-structured, unstructured big number are read from heterogeneous data source According to uploading to distributed file system HDFS and stored；

In step (2) provided by the invention, the data cleansing, which refers to, handles frame based on Spark big data, will be distributed Data in formula file system HDFS are loaded into memory, are denoised, duplicate removal, format conversion operation, and detailed process includes:

5 fusion method of data fusion module provided by the invention includes:

1) when merging the solid data in multiple sources, canonical representation is carried out to the attribute of each data source respectively, In include the mapping of synonymous attribute and the unified of numerical value unit of attribute value converted；

2) piecemeal polymerization is carried out to entity based on physical name and entity attribute；

3) using the entity of separate sources in same piecemeal as candidate entity pair, using between entity alignment algorithm computational entity Similarity, matching is obtained describing in separate sources the entity pair of same objective world, is established same between different data sources The link of equal value of entity, and the merging of entity attribute is carried out, and for entity exclusive in a data source, it is directly appended to know Know in library.

Progress canonical representation provided by the invention includes the method for normalizing of logarithm type attribute and date type attribute, described The attribute value of date type attribute is collectively expressed as XX XX month XX day, and the specification of the attribute value of Numeric Attributes is mainly wrapped Numerical value conversion and unit one or two of step of system are included, numerical value conversion refers to kilobit separator, the Chinese capitalization number in original numerical value Situations such as word, is completely converted into Arabic numerals, and unit is unified then between progress numerical value conversion the not commensurate under same category.

It is provided by the invention that piecemeal polymerization is carried out to entity based on physical name and entity attribute, it is necessary first to which that entity is carried out Piecemeal, it would be possible to be directed toward consistent entity to being put into same, then using the entity of separate sources in same as candidate With entity pair, compare whether the entity in different data sources is same reference two-by-two.

Piecemeal provided by the invention is grouped entity using the partition strategy based on entity name and entity attribute poly- It closes, the specific process of the packet aggregation is, first according to entity name, entity name is decomposed into binary model sequence； Secondly, the key value for the item in each binary model sequence as inverted index, it is corresponding to be inserted into this for the entity In row's index；Then, it by the corresponding entity of key value each in inverted index, is divided again according to entity attribute, finally, such as The entity of two separate sources of fruit has more than two identical attributes and attribute value, then is subdivided into same piecemeal.

In the above-described embodiments, can come wholly or partly by software, hardware, firmware or any combination thereof real It is existing.When using entirely or partly realizing in the form of a computer program product, the computer program product include one or Multiple computer instructions.When loading on computers or executing the computer program instructions, entirely or partly generate according to Process described in the embodiment of the present invention or function.The computer can be general purpose computer, special purpose computer, computer network Network or other programmable devices.The computer instruction may be stored in a computer readable storage medium, or from one Computer readable storage medium is transmitted to another computer readable storage medium, for example, the computer instruction can be from one A web-site, computer, server or data center pass through wired (such as coaxial cable, optical fiber, Digital Subscriber Line (DSL) Or wireless (such as infrared, wireless, microwave etc.) mode is carried out to another web-site, computer, server or data center Transmission).The computer-readable storage medium can be any usable medium or include one that computer can access The data storage devices such as a or multiple usable mediums integrated server, data center.The usable medium can be magnetic Jie Matter, (for example, floppy disk, hard disk, tape), optical medium (for example, DVD) or semiconductor medium (such as solid state hard disk Solid State Disk (SSD)) etc..

The above is only the preferred embodiments of the present invention, and is not intended to limit the present invention in any form, Any simple modification made to the above embodiment according to the technical essence of the invention, equivalent variations and modification, belong to In the range of technical solution of the present invention.

Claims

1. the multi-source data under a kind of environment based on big data polymerize the methods of sampling, which is characterized in that described to be based on big data ring Multi-source data under border polymerize the methods of sampling

Separate sources data set is subjected to fusion treatment using fusion program by data fusion module；Merging multiple sources When solid data, canonical representation is carried out to the attribute of each data source respectively, which includes the mapping of synonymous attribute and to category Property value numerical value unit unified conversion；Piecemeal polymerization is carried out to entity based on physical name and entity attribute；It will be in same piecemeal The entity of separate sources is obtained matching not using the similarity between entity alignment algorithm computational entity as candidate entity pair With the entity pair for describing same objective world in source, the link of equal value of same entity between different data sources is established, and is carried out The merging of entity attribute, and for entity exclusive in a data source, it is directly appended in knowledge base；

s.t.X_i=X_iA_i+E_i, i=1 ..., K

Wherein α is greater than 0 coefficient,It is brought for measuring normal word and abnormal word participle Error；

It is equivalent to drag:

2. polymerizeing the methods of sampling based on the multi-source data under big data environment as described in claim 1, which is characterized in that the base The multi-source data polymerization methods of sampling under big data environment further comprises:

Step 1 acquires multiple original data sources by data source acquisition module, and each original data source includes data source name With at least one associated domain；

Step 2, central control module cleaned by preprocessing module using data source of the data processor to acquisition, Identification, removal redundant operation；

Step 3 constructs module using construction procedures according to original data source by Policy List, obtains original strategy list, right Original strategy in original strategy list is ranked up, and forms Policy List between data source；

Step 6 chooses the seed root node key words of datum target guiding, input by decimation blocks using sample program Roll-the-snowball sampling depth sets balanced verification numerical value on the basis of seed root node data, circulation to each word, into Row roll-the-snowball sampling；

3. polymerizeing the methods of sampling based on the multi-source data under big data environment as claimed in claim 2, which is characterized in that pretreatment Resume module method includes:

(1) distributed file system HDFS is uploaded to according to the data in preset condition extraction heterogeneous data source to be stored；

(2) data in distributed file system HDFS are loaded by memory using Spark frame, remove repeated data, noise Data carry out format conversion operation；

(3) to the data after cleaning, the different representation methods of the same entity are identified, what is correctly identified out is all different real Body merges the data of same entity；

4. polymerizeing the methods of sampling based on the multi-source data under big data environment as claimed in claim 3, which is characterized in that the step Suddenly in (1), structuring, semi-structured, unstructured big data are read from heterogeneous data source, uploads to distributed file system HDFS is stored；

The format of the heterogeneous data source includes: Txt, Csv, Xsl, database data, jpg, mp4, and provide interface standard with Just source of new data is extended；

Are read by text data from text file, is deposited by designing text storage function for text file, including Txt, Csv It stores up in distributed file system HDFS；

For Xsl file, by designing Xsl storage function, excel data, storage to distributed text are read from Excel file In part system HDFS；

For database data, including MySQL, Oracle, read from database by database access interface ODBC or JDBC It takes, stores into distributed file system HDFS；

Corresponding data are read by designing corresponding file storage function for other kinds of file, including jpg, mp4 Data in source are stored into distributed file system HDFS.

5. polymerizeing the methods of sampling based on the multi-source data under big data environment as claimed in claim 3, which is characterized in that the step Suddenly in (2), the data cleansing, which refers to, handles frame based on Spark big data, by the data in distributed file system HDFS It is loaded into memory, is denoised, duplicate removal, format conversion operates, and detailed process includes:

It reads data: data model being established based on SparkRDD/DataFrame, the data in HDFS file is read, is converted into RDD/DataFrame；

Repeated data: the data that read step generates is removed, removes repeated data by design function or using built-in function；

6. as claimed in claim 5 based under big data environment multi-source data polymerize the methods of sampling, which is characterized in that it is described into Row canonical representation includes the method for normalizing of logarithm type attribute and date type attribute, and the attribute value of the date type attribute is united One is expressed as XX XX month XX day, and the specification for the attribute value of Numeric Attributes mainly includes numerical value conversion and unit unified two A step, numerical value conversion refer to by original numerical value kilobit separator, Chinese word figure situations such as be completely converted into me Uncle's number, unit are unified then between progress numerical value conversion the not commensurate under same category；

It is described that piecemeal polymerization is carried out to entity based on physical name and entity attribute, it is necessary first to which that piecemeal is carried out to entity, it would be possible to Consistent entity is directed toward to being put into same, then using the entity of separate sources in same as candidate matches entity pair, two Whether two entities compared in different data sources are same references；

The piecemeal is grouped polymerization to entity using the partition strategy based on entity name and entity attribute, and the grouping is poly- The specific process closed is, first according to entity name, entity name to be decomposed into binary model sequence；Secondly, for each Key value of the item as inverted index in binary model sequence, which is inserted into the corresponding inverted index of this；So Afterwards, it by the corresponding entity of key value each in inverted index, is divided again according to entity attribute, finally, if two differences are come The entity in source has more than two identical attributes and attribute value, then is subdivided into same piecemeal.

7. a kind of realize polymerize the methods of sampling based on the multi-source data under big data environment described in claim 1~6 any one Information data processing terminal.

8. a kind of computer readable storage medium, including instruction, when run on a computer, so that computer is executed as weighed Benefit requires to polymerize the methods of sampling based on the multi-source data under big data environment described in 1-6 any one.

9. a kind of implement described in claim 1 based on the multi-source data polymerization methods of sampling under big data environment based on big data Multi-source data under environment polymerize sampling system, which is characterized in that the multi-source data under the environment based on big data, which polymerize, to be taken out Sample system includes:

Data source acquisition module, connect with central control module, for acquiring multiple original data sources, each original data source packet Include data source name and at least one associated domain；

Central control module constructs module with data source acquisition module, preprocessing module, Policy List, data fusion module, divides Word module, decimation blocks, display module connection, work normally for controlling modules by central processing unit；

Preprocessing module is connect with central control module, for being cleaned by data source of the data processor to acquisition, Identification, removal redundant operation；

Policy List constructs module, connect with central control module, for, according to original data source, being obtained former by construction procedures Beginning Policy List is ranked up the original strategy in original strategy list, forms Policy List between data source；

Data fusion module is connect with central control module, for being merged separate sources data set by fusion program Processing；

Word segmentation module is connect with central control module, for segmenting fused file, forms the two dimension of file word Frequency matrix；

Decimation blocks are connect with central control module, for choosing the seed root node of datum target guiding by sample program Key words input roll-the-snowball sampling depth, on the basis of seed root node data, set balanced verification numerical value, circulation To each word, roll-the-snowball sampling is carried out；

10. a kind of implement described in claim 1 based on the multi-source data polymerization methods of sampling under big data environment based on big number According to the multi-source data polymerization sampling Cloud Server under environment.