CN110147357A - The multi-source data polymerization methods of sampling and system under a kind of environment based on big data - Google Patents

The multi-source data polymerization methods of sampling and system under a kind of environment based on big data Download PDF

Info

Publication number
CN110147357A
CN110147357A CN201910373940.8A CN201910373940A CN110147357A CN 110147357 A CN110147357 A CN 110147357A CN 201910373940 A CN201910373940 A CN 201910373940A CN 110147357 A CN110147357 A CN 110147357A
Authority
CN
China
Prior art keywords
data
source
entity
module
sampling
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910373940.8A
Other languages
Chinese (zh)
Inventor
云本胜
钱亚冠
胡月
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Lover Health Science and Technology Development Co Ltd
Zhejiang University of Science and Technology ZUST
Original Assignee
Zhejiang Lover Health Science and Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Lover Health Science and Technology Development Co Ltd filed Critical Zhejiang Lover Health Science and Technology Development Co Ltd
Priority to CN201910373940.8A priority Critical patent/CN110147357A/en
Publication of CN110147357A publication Critical patent/CN110147357A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/11File system administration, e.g. details of archiving or snapshots
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to big data technical field, the multi-source data polymerization methods of sampling and system under a kind of environment based on big data are disclosed, acquires multiple original data sources, each original data source includes data source name and at least one associated domain;The data source of acquisition is cleaned, identified, removes redundant operation;Using construction procedures according to original data source, original strategy list is obtained, the original strategy in original strategy list is ranked up, forms Policy List between data source;Separate sources data set is subjected to fusion treatment using fusion program;Fused file is segmented, the two-dimentional frequency matrix of file word is formed;The balanced verification numerical value of setting, circulation carry out roll-the-snowball sampling to each word;Utilize the multi-source data of display display acquisition.The present invention is dispatched by preprocessing module calculate node by Spark, is completed distributed computing, be can be realized more efficiently data prediction, practical, applied widely.

Description

The multi-source data polymerization methods of sampling and system under a kind of environment based on big data
Technical field
It polymerize the invention belongs to the multi-source data under big data technical field more particularly to a kind of environment based on big data and takes out Quadrat method and system.
Background technique
Multisource data fusion technology, which refers to, is all integrated into one for all information investigated, analysis is got using correlation means It rises, and carries out unified evaluation to information, finally obtain the technology of unified information.The purpose that the technical research comes out is will be each Then the characteristics of different data information of kind is integrated, draws different data sources therefrom extracts unification, than single data More preferably, richer information.However, in multi-source data polymerization sampling process under existing big data environment, to structural data, For semi-structured, non-structured data prediction research is insufficient, and usually only comprising data acquisition and data cleansing two A module, and the method for data cleansing is also fairly simple, cannot meet user demand well;Meanwhile when the fusion of data, There is no open link data set as priori knowledge, efficiently and accurately can not be carried out on a large scale in the case where reducing more complicated degree The fusion of heterogeneous data source.
In conclusion problem of the existing technology is:
In multi-source data polymerization sampling process under existing big data environment, to structural data, for semi-structured, non- The data prediction of structuring studies deficiency, and usual includes data acquisition and two modules of data cleansing, and data The method of cleaning is also fairly simple, cannot meet user demand well;Meanwhile when the fusion of data, there is no open link number It is used as priori knowledge according to collection, melting for large scale scale heterogeneous data source can not be carried out by efficiently and accurately in the case where reducing more complicated degree It closes.
Summary of the invention
In view of the problems of the existing technology, the present invention provides the multi-source data polymerizations under a kind of environment based on big data The methods of sampling and system.
The invention is realized in this way the multi-source data under a kind of environment based on big data polymerize the methods of sampling, the base Multi-source data under big data environment polymerize the methods of sampling
Separate sources data set is subjected to fusion treatment using fusion program by data fusion module;It is multiple next merging When the solid data in source, canonical representation is carried out to the attribute of each data source respectively, which includes the mapping of synonymous attribute and Unified conversion to the numerical value unit of attribute value;Piecemeal polymerization is carried out to entity based on physical name and entity attribute;By same point The entity of separate sources will be matched as candidate entity pair using the similarity between entity alignment algorithm computational entity in block The entity pair that same objective world is described into separate sources establishes the link of equal value of same entity between different data sources, and The merging of entity attribute is carried out, and for entity exclusive in a data source, it is directly appended in knowledge base;
Fused file is segmented by word segmentation module, forms the two-dimentional frequency matrix of file word;
s.t.Xi=XiAi+Ei, i=1 ..., K
Wherein α is greater than 0 coefficient,For measuring normal word and abnormal word participle Bring error;
It is equivalent to drag:
s.t.Xi=XiSi+Ei,
Ai=Ji,
Ai=Si, i=1 ..., K
Further, the multi-source data polymerization methods of sampling under the environment based on big data further comprises:
Step 1 acquires multiple original data sources by data source acquisition module, and each original data source includes data source Title and at least one associated domain;
Step 2, central control module are carried out by preprocessing module using data source of the data processor to acquisition clear It washes, identify, remove redundant operation;
Step 3 constructs module using construction procedures according to original data source by Policy List, obtains original strategy column Table is ranked up the original strategy in original strategy list, forms Policy List between data source;
Separate sources data set is carried out fusion treatment using fusion program by data fusion module by step 4;
Step 5 is segmented fused file by word segmentation module, forms the two-dimentional frequency matrix of file word;
Step 6 chooses the seed root node key words of datum target guiding by decimation blocks using sample program, Roll-the-snowball sampling depth is inputted, on the basis of seed root node data, sets balanced verification numerical value, circulation is to each word Language carries out roll-the-snowball sampling;
Step 7 utilizes the multi-source data of display display acquisition by display module.
Further, preprocessing module processing method includes:
(1) distributed file system HDFS is uploaded to according to the data in preset condition extraction heterogeneous data source to be deposited Storage;
(2) data in distributed file system HDFS are loaded by memory using Spark frame, remove repeated data, Noise data carries out format conversion operation;
(3) to the data after cleaning, the different representation methods of the same entity are identified, correctly identify out it is all not Same entity merges the data of same entity;
(4) data de-duplication technology based on cryptographic Hash is used, redundant data is removed.
Further, in the step (1), structuring, semi-structured, unstructured big number are read from heterogeneous data source According to uploading to distributed file system HDFS and stored;
The format of the heterogeneous data source includes: Txt, Csv, Xsl, database data, jpg, mp4, and provides interface mark Standard is to extend source of new data;
Textual data is read from text file by designing text storage function for text file, including Txt, Csv According to storage is into distributed file system HDFS;
For Xsl file, by designing Xsl storage function, excel data, storage to distribution are read from Excel file In formula file system HDFS;
For database data, including MySQL, Oracle, by database access interface ODBC or JDBC from database Middle reading is stored into distributed file system HDFS;
Other kinds of file, including jpg, mp4 are read corresponding by designing corresponding file storage function Data in data source are stored into distributed file system HDFS.
Further, in the step (2), the data cleansing, which refers to, handles frame based on Spark big data, will be distributed Data in file system HDFS are loaded into memory, are denoised, duplicate removal, format conversion operation, and detailed process includes:
It reads data: data model is established based on SparkRDD/DataFrame, read the data in HDFS file, conversion For RDD/DataFrame;
Repeated data: the data that read step generates is removed, removes repetition by design function or using built-in function Data;
It removes noise data: freely configuring for combination condition judgment rule is realized using regulation engine, reduce or remove Noise data, and effective information is avoided to lose;
Format conversion is carried out, converts unified format for the data of different-format.
Further, the progress canonical representation includes the method for normalizing of logarithm type attribute and date type attribute, described The attribute value of date type attribute is collectively expressed as XX XX month XX day, and the specification of the attribute value of Numeric Attributes is mainly wrapped Numerical value conversion and unit one or two of step of system are included, numerical value conversion refers to kilobit separator, the Chinese capitalization number in original numerical value Situations such as word, is completely converted into Arabic numerals, and unit is unified then between progress numerical value conversion the not commensurate under same category;
It is described that piecemeal polymerization is carried out to entity based on physical name and entity attribute, it is necessary first to piecemeal is carried out to entity, it will Consistent entity may be directed toward to being put into same, then using the entity of separate sources in same as candidate matches entity It is right, compare whether the entity in different data sources is same reference two-by-two;
The piecemeal is grouped polymerization to entity using the partition strategy based on entity name and entity attribute, and described point The specific process of group polymerization is, first according to entity name, entity name to be decomposed into binary model sequence;Secondly, for Key value of the item as inverted index in each binary model sequence, which is inserted into the corresponding inverted index of this; Then, it by the corresponding entity of key value each in inverted index, is divided again according to entity attribute, finally, if two differences The entity in source has more than two identical attributes and attribute value, then is subdivided into same piecemeal.
Another object of the present invention is to provide polymerize to take out based on the multi-source data under big data environment described in a kind of realize The information data processing terminal of quadrat method.
Another object of the present invention is to provide a kind of computer readable storage mediums, including instruction, when it is in computer When upper operation, so that polymerizeing the methods of sampling based on the multi-source data under big data environment described in computer execution.
Another object of the present invention is to provide polymerize to take out based on the multi-source data under big data environment described in a kind of implementation Quadrat method polymerize sampling system based on the multi-source data under big data environment, the multi-source data under the environment based on big data Polymerizeing sampling system includes:
Data source acquisition module, connect with central control module, for acquiring multiple original data sources, each initial data Source includes data source name and at least one associated domain;
Central control module constructs module, data fusion mould with data source acquisition module, preprocessing module, Policy List Block, word segmentation module, decimation blocks, display module connection, work normally for controlling modules by central processing unit;
Preprocessing module is connect with central control module, for being carried out by data source of the data processor to acquisition Cleaning, identification, removal redundant operation;
Policy List constructs module, connect with central control module, for, according to original data source, being obtained by construction procedures Original strategy list is taken, the original strategy in original strategy list is ranked up, forms Policy List between data source;
Data fusion module is connect with central control module, for being carried out separate sources data set by fusion program Fusion treatment;
Word segmentation module is connect with central control module, for segmenting fused file, forms file word Two-dimentional frequency matrix;
Decimation blocks are connect with central control module, for choosing the seminal root of datum target guiding by sample program Node key words input roll-the-snowball sampling depth, on the basis of seed root node data, set balanced verification numerical value, follow Ring carries out roll-the-snowball sampling to each word;
Display module is connect with central control module, for the multi-source data by display display acquisition.
Another object of the present invention is to provide polymerize to take out based on the multi-source data under big data environment described in a kind of implementation Quadrat method polymerize sampling Cloud Server based on the multi-source data under big data environment.
Advantages of the present invention and good effect are as follows:
The present invention handles frame using Spark big data by preprocessing module and pre-processes to big data, not only may be used To reduce storage resource and network bandwidth, data storage efficiency is improved, and can be improved the quality of subsequent data analysis work. Spark frame supports that datarams are resident, can be improved read or write speed by building elasticity distribution formula data set (RDD) structure. Calculate node is dispatched by Spark, is completed distributed computing, be can be realized more efficiently data prediction, practical, is applicable in model It encloses wide;Meanwhile the data of needs can be extracted from the knowledge base of multiple and different field different scales by data fusion module Partial data source needed for constituting support applications by fusion, the data fusion of multiple data sources is got up, and merges redundancy, Expand useful information.
Separate sources data set is carried out fusion treatment using fusion program by data fusion module by the present invention;It is merging When the solid data in multiple sources, canonical representation is carried out to the attribute of each data source respectively, which includes synonymous attributes The unified conversion of mapping and the numerical value unit to attribute value;Piecemeal polymerization is carried out to entity based on physical name and entity attribute;It will The entity of separate sources will using the similarity between entity alignment algorithm computational entity as candidate entity pair in same piecemeal Matching obtains the entity pair that same objective world is described in separate sources, establishes the chain of equal value of same entity between different data sources It connects, and carries out the merging of entity attribute, and for entity exclusive in a data source, it is directly appended in knowledge base;
Fused file is segmented by word segmentation module, forms the two-dimentional frequency matrix of file word;
s.t.Xi=XiAi+Ei, i=1 ..., K
Wherein α is greater than 0 coefficient,For measuring normal word and abnormal word participle Bring error;
It is equivalent to drag:
s.t.Xi=XiSi+Ei,
Ai=Ji,
Ai=Si, i=1 ..., K
Large scale scale heterogeneous data can not be carried out by efficiently and accurately in the case where reducing more complicated degree by solving the prior art The fusion problem in source.
Detailed description of the invention
Fig. 1 is the multi-source data polymerization methods of sampling flow chart under the environment provided in an embodiment of the present invention based on big data.
Fig. 2 is the multi-source data polymerization sampling system structural frames under the environment provided in an embodiment of the present invention based on big data Figure.
In figure: 1, data source acquisition module;2, central control module;3, preprocessing module;4, Policy List constructs module; 5, data fusion module;6, word segmentation module;7, decimation blocks;8, display module.
Specific embodiment
In order to further understand the content, features and effects of the present invention, the following examples are hereby given, and cooperate attached drawing Detailed description includes.
Structure of the invention is explained in detail with reference to the accompanying drawing.
As shown in Figure 1, the multi-source data polymerization methods of sampling under the environment provided by the invention based on big data includes following Step:
S101 acquires multiple original data sources by data source acquisition module, and each original data source includes dataSource link Claim and at least one associated domain;
S102, central control module are carried out by preprocessing module using data source of the data processor to acquisition clear It washes, identify, remove redundant operation;
S103 constructs module using construction procedures according to original data source by Policy List, obtains original strategy list, Original strategy in original strategy list is ranked up, Policy List between data source is formed;
Separate sources data set is carried out fusion treatment using fusion program by data fusion module by S104;
S105 is segmented fused file by word segmentation module, forms the two-dimentional frequency matrix of file word;
S106 chooses the seed root node key words of datum target guiding by decimation blocks using sample program, defeated Enter roll-the-snowball sampling depth, on the basis of seed root node data, set balanced verification numerical value, circulation to each word, Carry out roll-the-snowball sampling;
S107 utilizes the multi-source data of display display acquisition by display module.
In step S104, separate sources data set is carried out using fusion program by fusion treatment by data fusion module; When merging the solid data in multiple sources, canonical representation is carried out to the attribute of each data source respectively, which includes same The unified conversion of adopted attribute mapping and the numerical value unit to attribute value;It is poly- that piecemeal is carried out to entity based on physical name and entity attribute It closes;Using the entity of separate sources in same piecemeal as candidate entity pair, using similar between entity alignment algorithm computational entity Matching is obtained the entity pair for describing same objective world in separate sources, establishes same entity between different data sources by degree Equivalence link, and the merging of entity attribute is carried out, and for entity exclusive in a data source, it is directly appended to knowledge base In;
Fused file is segmented by word segmentation module, forms the two-dimentional frequency matrix of file word;
s.t.Xi=XiAi+Ei, i=1 ..., K
Wherein α is greater than 0 coefficient,For measuring normal word and abnormal word participle Bring error;
It is equivalent to drag:
s.t.Xi=XiSi+Ei,
Ai=Ji,
Ai=Si, i=1 ..., K
As shown in Fig. 2, the multi-source data under the environment provided in an embodiment of the present invention based on big data polymerize sampling system packet Include: data source acquisition module 1, central control module 2, preprocessing module 3, Policy List building module 4, data fusion module 5, Word segmentation module 6, decimation blocks 7, display module 8.
Data source acquisition module 1 is connect with central control module 2, for acquiring multiple original data sources, each original number It include data source name and at least one associated domain according to source;
Central control module 2 melts with data source acquisition module 1, preprocessing module 3, Policy List building module 4, data Block 5, word segmentation module 6, decimation blocks 7, display module 8 is molded to connect, it is normal for controlling modules by central processing unit Work;
Preprocessing module 3 is connect with central control module 2, for by data processor to the data source of acquisition into Row cleaning, identification, removal redundant operation;
Policy List constructs module 4, connect with central control module 2, for passing through construction procedures according to original data source, Original strategy list is obtained, the original strategy in original strategy list is ranked up, forms Policy List between data source;
Data fusion module 5 is connect with central control module 2, for by fusion program by separate sources data set into Row fusion treatment;
Word segmentation module 6 is connect with central control module 2, for segmenting fused file, forms file word Two-dimentional frequency matrix;
Decimation blocks 7 are connect with central control module 2, for choosing the seed of datum target guiding by sample program Root node key words input roll-the-snowball sampling depth, on the basis of seed root node data, set balanced verification numerical value, Circulation carries out roll-the-snowball sampling to each word;
Display module 8 is connect with central control module 2, for the multi-source data by display display acquisition.
3 processing method of preprocessing module provided by the invention includes:
(1) distributed file system HDFS is uploaded to according to the data in preset condition extraction heterogeneous data source to be deposited Storage;
(2) data in distributed file system HDFS are loaded by memory using Spark frame, remove repeated data, Noise data carries out format conversion operation;
(3) to the data after cleaning, the different representation methods of the same entity are identified, correctly identify out it is all not Same entity merges the data of same entity;
(4) data de-duplication technology based on cryptographic Hash is used, redundant data is removed.
In step (1) provided by the invention, structuring, semi-structured, unstructured big number are read from heterogeneous data source According to uploading to distributed file system HDFS and stored;
The format of the heterogeneous data source includes: Txt, Csv, Xsl, database data, jpg, mp4, and provides interface mark Standard is to extend source of new data;
Textual data is read from text file by designing text storage function for text file, including Txt, Csv According to storage is into distributed file system HDFS;
For Xsl file, by designing Xsl storage function, excel data, storage to distribution are read from Excel file In formula file system HDFS;
For database data, including MySQL, Oracle, by database access interface ODBC or JDBC from database Middle reading is stored into distributed file system HDFS;
Other kinds of file, including jpg, mp4 are read corresponding by designing corresponding file storage function Data in data source are stored into distributed file system HDFS.
In step (2) provided by the invention, the data cleansing, which refers to, handles frame based on Spark big data, will be distributed Data in formula file system HDFS are loaded into memory, are denoised, duplicate removal, format conversion operation, and detailed process includes:
It reads data: data model is established based on SparkRDD/DataFrame, read the data in HDFS file, conversion For RDD/DataFrame;
Repeated data: the data that read step generates is removed, removes repetition by design function or using built-in function Data;
It removes noise data: freely configuring for combination condition judgment rule is realized using regulation engine, reduce or remove Noise data, and effective information is avoided to lose;
Format conversion is carried out, converts unified format for the data of different-format.
5 fusion method of data fusion module provided by the invention includes:
1) when merging the solid data in multiple sources, canonical representation is carried out to the attribute of each data source respectively, In include the mapping of synonymous attribute and the unified of numerical value unit of attribute value converted;
2) piecemeal polymerization is carried out to entity based on physical name and entity attribute;
3) using the entity of separate sources in same piecemeal as candidate entity pair, using between entity alignment algorithm computational entity Similarity, matching is obtained describing in separate sources the entity pair of same objective world, is established same between different data sources The link of equal value of entity, and the merging of entity attribute is carried out, and for entity exclusive in a data source, it is directly appended to know Know in library.
Progress canonical representation provided by the invention includes the method for normalizing of logarithm type attribute and date type attribute, described The attribute value of date type attribute is collectively expressed as XX XX month XX day, and the specification of the attribute value of Numeric Attributes is mainly wrapped Numerical value conversion and unit one or two of step of system are included, numerical value conversion refers to kilobit separator, the Chinese capitalization number in original numerical value Situations such as word, is completely converted into Arabic numerals, and unit is unified then between progress numerical value conversion the not commensurate under same category.
It is provided by the invention that piecemeal polymerization is carried out to entity based on physical name and entity attribute, it is necessary first to which that entity is carried out Piecemeal, it would be possible to be directed toward consistent entity to being put into same, then using the entity of separate sources in same as candidate With entity pair, compare whether the entity in different data sources is same reference two-by-two.
Piecemeal provided by the invention is grouped entity using the partition strategy based on entity name and entity attribute poly- It closes, the specific process of the packet aggregation is, first according to entity name, entity name is decomposed into binary model sequence; Secondly, the key value for the item in each binary model sequence as inverted index, it is corresponding to be inserted into this for the entity In row's index;Then, it by the corresponding entity of key value each in inverted index, is divided again according to entity attribute, finally, such as The entity of two separate sources of fruit has more than two identical attributes and attribute value, then is subdivided into same piecemeal.
In the above-described embodiments, can come wholly or partly by software, hardware, firmware or any combination thereof real It is existing.When using entirely or partly realizing in the form of a computer program product, the computer program product include one or Multiple computer instructions.When loading on computers or executing the computer program instructions, entirely or partly generate according to Process described in the embodiment of the present invention or function.The computer can be general purpose computer, special purpose computer, computer network Network or other programmable devices.The computer instruction may be stored in a computer readable storage medium, or from one Computer readable storage medium is transmitted to another computer readable storage medium, for example, the computer instruction can be from one A web-site, computer, server or data center pass through wired (such as coaxial cable, optical fiber, Digital Subscriber Line (DSL) Or wireless (such as infrared, wireless, microwave etc.) mode is carried out to another web-site, computer, server or data center Transmission).The computer-readable storage medium can be any usable medium or include one that computer can access The data storage devices such as a or multiple usable mediums integrated server, data center.The usable medium can be magnetic Jie Matter, (for example, floppy disk, hard disk, tape), optical medium (for example, DVD) or semiconductor medium (such as solid state hard disk Solid State Disk (SSD)) etc..
The above is only the preferred embodiments of the present invention, and is not intended to limit the present invention in any form, Any simple modification made to the above embodiment according to the technical essence of the invention, equivalent variations and modification, belong to In the range of technical solution of the present invention.

Claims (10)

1. the multi-source data under a kind of environment based on big data polymerize the methods of sampling, which is characterized in that described to be based on big data ring Multi-source data under border polymerize the methods of sampling
Separate sources data set is subjected to fusion treatment using fusion program by data fusion module;Merging multiple sources When solid data, canonical representation is carried out to the attribute of each data source respectively, which includes the mapping of synonymous attribute and to category Property value numerical value unit unified conversion;Piecemeal polymerization is carried out to entity based on physical name and entity attribute;It will be in same piecemeal The entity of separate sources is obtained matching not using the similarity between entity alignment algorithm computational entity as candidate entity pair With the entity pair for describing same objective world in source, the link of equal value of same entity between different data sources is established, and is carried out The merging of entity attribute, and for entity exclusive in a data source, it is directly appended in knowledge base;
Fused file is segmented by word segmentation module, forms the two-dimentional frequency matrix of file word;
s.t.Xi=XiAi+Ei, i=1 ..., K
Wherein α is greater than 0 coefficient,It is brought for measuring normal word and abnormal word participle Error;
It is equivalent to drag:
2. polymerizeing the methods of sampling based on the multi-source data under big data environment as described in claim 1, which is characterized in that the base The multi-source data polymerization methods of sampling under big data environment further comprises:
Step 1 acquires multiple original data sources by data source acquisition module, and each original data source includes data source name With at least one associated domain;
Step 2, central control module cleaned by preprocessing module using data source of the data processor to acquisition, Identification, removal redundant operation;
Step 3 constructs module using construction procedures according to original data source by Policy List, obtains original strategy list, right Original strategy in original strategy list is ranked up, and forms Policy List between data source;
Separate sources data set is carried out fusion treatment using fusion program by data fusion module by step 4;
Step 5 is segmented fused file by word segmentation module, forms the two-dimentional frequency matrix of file word;
Step 6 chooses the seed root node key words of datum target guiding, input by decimation blocks using sample program Roll-the-snowball sampling depth sets balanced verification numerical value on the basis of seed root node data, circulation to each word, into Row roll-the-snowball sampling;
Step 7 utilizes the multi-source data of display display acquisition by display module.
3. polymerizeing the methods of sampling based on the multi-source data under big data environment as claimed in claim 2, which is characterized in that pretreatment Resume module method includes:
(1) distributed file system HDFS is uploaded to according to the data in preset condition extraction heterogeneous data source to be stored;
(2) data in distributed file system HDFS are loaded by memory using Spark frame, remove repeated data, noise Data carry out format conversion operation;
(3) to the data after cleaning, the different representation methods of the same entity are identified, what is correctly identified out is all different real Body merges the data of same entity;
(4) data de-duplication technology based on cryptographic Hash is used, redundant data is removed.
4. polymerizeing the methods of sampling based on the multi-source data under big data environment as claimed in claim 3, which is characterized in that the step Suddenly in (1), structuring, semi-structured, unstructured big data are read from heterogeneous data source, uploads to distributed file system HDFS is stored;
The format of the heterogeneous data source includes: Txt, Csv, Xsl, database data, jpg, mp4, and provide interface standard with Just source of new data is extended;
Are read by text data from text file, is deposited by designing text storage function for text file, including Txt, Csv It stores up in distributed file system HDFS;
For Xsl file, by designing Xsl storage function, excel data, storage to distributed text are read from Excel file In part system HDFS;
For database data, including MySQL, Oracle, read from database by database access interface ODBC or JDBC It takes, stores into distributed file system HDFS;
Corresponding data are read by designing corresponding file storage function for other kinds of file, including jpg, mp4 Data in source are stored into distributed file system HDFS.
5. polymerizeing the methods of sampling based on the multi-source data under big data environment as claimed in claim 3, which is characterized in that the step Suddenly in (2), the data cleansing, which refers to, handles frame based on Spark big data, by the data in distributed file system HDFS It is loaded into memory, is denoised, duplicate removal, format conversion operates, and detailed process includes:
It reads data: data model being established based on SparkRDD/DataFrame, the data in HDFS file is read, is converted into RDD/DataFrame;
Repeated data: the data that read step generates is removed, removes repeated data by design function or using built-in function;
It removes noise data: freely configuring for combination condition judgment rule is realized using regulation engine, reduce or remove noise Data, and effective information is avoided to lose;
Format conversion is carried out, converts unified format for the data of different-format.
6. as claimed in claim 5 based under big data environment multi-source data polymerize the methods of sampling, which is characterized in that it is described into Row canonical representation includes the method for normalizing of logarithm type attribute and date type attribute, and the attribute value of the date type attribute is united One is expressed as XX XX month XX day, and the specification for the attribute value of Numeric Attributes mainly includes numerical value conversion and unit unified two A step, numerical value conversion refer to by original numerical value kilobit separator, Chinese word figure situations such as be completely converted into me Uncle's number, unit are unified then between progress numerical value conversion the not commensurate under same category;
It is described that piecemeal polymerization is carried out to entity based on physical name and entity attribute, it is necessary first to which that piecemeal is carried out to entity, it would be possible to Consistent entity is directed toward to being put into same, then using the entity of separate sources in same as candidate matches entity pair, two Whether two entities compared in different data sources are same references;
The piecemeal is grouped polymerization to entity using the partition strategy based on entity name and entity attribute, and the grouping is poly- The specific process closed is, first according to entity name, entity name to be decomposed into binary model sequence;Secondly, for each Key value of the item as inverted index in binary model sequence, which is inserted into the corresponding inverted index of this;So Afterwards, it by the corresponding entity of key value each in inverted index, is divided again according to entity attribute, finally, if two differences are come The entity in source has more than two identical attributes and attribute value, then is subdivided into same piecemeal.
7. a kind of realize polymerize the methods of sampling based on the multi-source data under big data environment described in claim 1~6 any one Information data processing terminal.
8. a kind of computer readable storage medium, including instruction, when run on a computer, so that computer is executed as weighed Benefit requires to polymerize the methods of sampling based on the multi-source data under big data environment described in 1-6 any one.
9. a kind of implement described in claim 1 based on the multi-source data polymerization methods of sampling under big data environment based on big data Multi-source data under environment polymerize sampling system, which is characterized in that the multi-source data under the environment based on big data, which polymerize, to be taken out Sample system includes:
Data source acquisition module, connect with central control module, for acquiring multiple original data sources, each original data source packet Include data source name and at least one associated domain;
Central control module constructs module with data source acquisition module, preprocessing module, Policy List, data fusion module, divides Word module, decimation blocks, display module connection, work normally for controlling modules by central processing unit;
Preprocessing module is connect with central control module, for being cleaned by data source of the data processor to acquisition, Identification, removal redundant operation;
Policy List constructs module, connect with central control module, for, according to original data source, being obtained former by construction procedures Beginning Policy List is ranked up the original strategy in original strategy list, forms Policy List between data source;
Data fusion module is connect with central control module, for being merged separate sources data set by fusion program Processing;
Word segmentation module is connect with central control module, for segmenting fused file, forms the two dimension of file word Frequency matrix;
Decimation blocks are connect with central control module, for choosing the seed root node of datum target guiding by sample program Key words input roll-the-snowball sampling depth, on the basis of seed root node data, set balanced verification numerical value, circulation To each word, roll-the-snowball sampling is carried out;
Display module is connect with central control module, for the multi-source data by display display acquisition.
10. a kind of implement described in claim 1 based on the multi-source data polymerization methods of sampling under big data environment based on big number According to the multi-source data polymerization sampling Cloud Server under environment.
CN201910373940.8A 2019-05-07 2019-05-07 The multi-source data polymerization methods of sampling and system under a kind of environment based on big data Pending CN110147357A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910373940.8A CN110147357A (en) 2019-05-07 2019-05-07 The multi-source data polymerization methods of sampling and system under a kind of environment based on big data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910373940.8A CN110147357A (en) 2019-05-07 2019-05-07 The multi-source data polymerization methods of sampling and system under a kind of environment based on big data

Publications (1)

Publication Number Publication Date
CN110147357A true CN110147357A (en) 2019-08-20

Family

ID=67594665

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910373940.8A Pending CN110147357A (en) 2019-05-07 2019-05-07 The multi-source data polymerization methods of sampling and system under a kind of environment based on big data

Country Status (1)

Country Link
CN (1) CN110147357A (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110515926A (en) * 2019-08-28 2019-11-29 国网天津市电力公司 Heterogeneous data source mass data carding method based on participle and semantic dependency analysis
CN110597879A (en) * 2019-09-17 2019-12-20 第四范式(北京)技术有限公司 Method and device for processing time series data
CN111400569A (en) * 2020-03-13 2020-07-10 重庆特斯联智慧科技股份有限公司 Big data analysis method and system of multi-source aggregation structure
CN111431967A (en) * 2020-02-25 2020-07-17 天宇经纬(北京)科技有限公司 Multi-source heterogeneous data representation and distribution method and device based on business rules
CN111581281A (en) * 2020-04-24 2020-08-25 贵州力创科技发展有限公司 Data fusion method and device
CN111639054A (en) * 2020-05-29 2020-09-08 中国人民解放军国防科技大学 Data coupling method, system and medium for ocean mode and data assimilation
CN111708773A (en) * 2020-08-13 2020-09-25 江苏宝和数据股份有限公司 Multi-source scientific and creative resource data fusion method
CN111966571A (en) * 2020-08-12 2020-11-20 重庆邮电大学 Time estimation cooperative processing method based on ARM-FPGA coprocessor heterogeneous platform
CN111985578A (en) * 2020-09-02 2020-11-24 深圳壹账通智能科技有限公司 Multi-source data fusion method and device, computer equipment and storage medium
CN112214573A (en) * 2020-10-30 2021-01-12 数贸科技(北京)有限公司 Information search system, method, computing device, and computer storage medium
CN112486989A (en) * 2020-11-28 2021-03-12 河北省科学技术情报研究院(河北省科技创新战略研究院) Multi-source data granulation fusion and index classification and layering processing method
CN112579770A (en) * 2019-09-30 2021-03-30 北京国双科技有限公司 Knowledge graph generation method, device, storage medium and equipment
WO2021135323A1 (en) * 2020-07-31 2021-07-08 平安科技(深圳)有限公司 Method and apparatus for fusion processing of municipal multi-source heterogeneous data, and computer device
CN113315813A (en) * 2021-05-08 2021-08-27 重庆第二师范学院 Information exchange method and system for big data internet information chain system
CN113609715A (en) * 2021-10-11 2021-11-05 深圳奥雅设计股份有限公司 Multivariate model data fusion method and system under digital twin background
CN114896963A (en) * 2022-07-08 2022-08-12 北京百炼智能科技有限公司 Data processing method and device, electronic equipment and storage medium
US11449514B2 (en) 2019-12-27 2022-09-20 Interset Software LLC Approximate aggregation queries

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107066534A (en) * 2017-03-02 2017-08-18 人谷科技(北京)有限责任公司 Multi-source data polymerization and system
CN107451282A (en) * 2017-08-09 2017-12-08 南京审计大学 A kind of multi-source data polymerization Sampling Strategies under the environment based on big data
CN107633075A (en) * 2017-09-22 2018-01-26 吉林大学 A kind of multi-source heterogeneous data fusion platform and fusion method
CN108470074A (en) * 2018-04-04 2018-08-31 河北北方学院 A kind of multi-source data under the environment based on big data polymerize sampling system
CN108647318A (en) * 2018-05-10 2018-10-12 北京航空航天大学 A kind of knowledge fusion method based on multi-source data
CN109165202A (en) * 2018-07-04 2019-01-08 华南理工大学 A kind of preprocess method of multi-source heterogeneous big data

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107066534A (en) * 2017-03-02 2017-08-18 人谷科技(北京)有限责任公司 Multi-source data polymerization and system
CN107451282A (en) * 2017-08-09 2017-12-08 南京审计大学 A kind of multi-source data polymerization Sampling Strategies under the environment based on big data
CN107633075A (en) * 2017-09-22 2018-01-26 吉林大学 A kind of multi-source heterogeneous data fusion platform and fusion method
CN108470074A (en) * 2018-04-04 2018-08-31 河北北方学院 A kind of multi-source data under the environment based on big data polymerize sampling system
CN108647318A (en) * 2018-05-10 2018-10-12 北京航空航天大学 A kind of knowledge fusion method based on multi-source data
CN109165202A (en) * 2018-07-04 2019-01-08 华南理工大学 A kind of preprocess method of multi-source heterogeneous big data

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110515926A (en) * 2019-08-28 2019-11-29 国网天津市电力公司 Heterogeneous data source mass data carding method based on participle and semantic dependency analysis
CN110597879A (en) * 2019-09-17 2019-12-20 第四范式(北京)技术有限公司 Method and device for processing time series data
CN110597879B (en) * 2019-09-17 2022-01-14 第四范式(北京)技术有限公司 Method and device for processing time series data
CN112579770A (en) * 2019-09-30 2021-03-30 北京国双科技有限公司 Knowledge graph generation method, device, storage medium and equipment
US11449514B2 (en) 2019-12-27 2022-09-20 Interset Software LLC Approximate aggregation queries
CN111431967A (en) * 2020-02-25 2020-07-17 天宇经纬(北京)科技有限公司 Multi-source heterogeneous data representation and distribution method and device based on business rules
CN111400569A (en) * 2020-03-13 2020-07-10 重庆特斯联智慧科技股份有限公司 Big data analysis method and system of multi-source aggregation structure
CN111581281A (en) * 2020-04-24 2020-08-25 贵州力创科技发展有限公司 Data fusion method and device
CN111639054A (en) * 2020-05-29 2020-09-08 中国人民解放军国防科技大学 Data coupling method, system and medium for ocean mode and data assimilation
CN111639054B (en) * 2020-05-29 2023-11-07 中国人民解放军国防科技大学 Data coupling method, system and medium for ocean mode and data assimilation
WO2021135323A1 (en) * 2020-07-31 2021-07-08 平安科技(深圳)有限公司 Method and apparatus for fusion processing of municipal multi-source heterogeneous data, and computer device
CN111966571A (en) * 2020-08-12 2020-11-20 重庆邮电大学 Time estimation cooperative processing method based on ARM-FPGA coprocessor heterogeneous platform
CN111966571B (en) * 2020-08-12 2023-05-12 重庆邮电大学 Time estimation cooperative processing method based on ARM-FPGA coprocessor heterogeneous platform
CN111708773A (en) * 2020-08-13 2020-09-25 江苏宝和数据股份有限公司 Multi-source scientific and creative resource data fusion method
CN111985578A (en) * 2020-09-02 2020-11-24 深圳壹账通智能科技有限公司 Multi-source data fusion method and device, computer equipment and storage medium
CN112214573A (en) * 2020-10-30 2021-01-12 数贸科技(北京)有限公司 Information search system, method, computing device, and computer storage medium
CN112486989A (en) * 2020-11-28 2021-03-12 河北省科学技术情报研究院(河北省科技创新战略研究院) Multi-source data granulation fusion and index classification and layering processing method
CN113315813A (en) * 2021-05-08 2021-08-27 重庆第二师范学院 Information exchange method and system for big data internet information chain system
CN113609715A (en) * 2021-10-11 2021-11-05 深圳奥雅设计股份有限公司 Multivariate model data fusion method and system under digital twin background
CN114896963A (en) * 2022-07-08 2022-08-12 北京百炼智能科技有限公司 Data processing method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN110147357A (en) The multi-source data polymerization methods of sampling and system under a kind of environment based on big data
JP7170779B2 (en) Methods and systems for automatic intent mining, classification, and placement
CA2953969C (en) Interactive interfaces for machine learning model evaluations
CN103336790B (en) Hadoop-based fast neighborhood rough set attribute reduction method
CN109739939A (en) The data fusion method and device of knowledge mapping
US20150379430A1 (en) Efficient duplicate detection for machine learning data sets
US20150379429A1 (en) Interactive interfaces for machine learning model evaluations
CN109165202A (en) A kind of preprocess method of multi-source heterogeneous big data
CN113032579B (en) Metadata blood relationship analysis method and device, electronic equipment and medium
US20230139783A1 (en) Schema-adaptable data enrichment and retrieval
KR102219955B1 (en) Behavior-based platform system using the bigdata
CN110990467B (en) BIM model format conversion method and conversion system
CN111627552B (en) Medical streaming data blood-edge relationship analysis and storage method and device
CN114462623B (en) Data analysis method, system and platform based on edge calculation
CN110928981A (en) Method, system and storage medium for establishing and perfecting iteration of text label system
Cong Personalized recommendation of film and television culture based on an intelligent classification algorithm
JP7347179B2 (en) Methods, devices and computer programs for extracting web page content
CN115146062A (en) Intelligent event analysis method and system fusing expert recommendation and text clustering
CN116522912B (en) Training method, device, medium and equipment for package design language model
Ravichandran Big Data processing with Hadoop: a review
JP2022168859A (en) Computer implementation method, computer program, and system (prediction query processing)
CN110543467B (en) Mode conversion method and device for time series database
US11514321B1 (en) Artificial intelligence system using unsupervised transfer learning for intra-cluster analysis
CN109086373B (en) Method for constructing fair link prediction evaluation system
Shouaib et al. Survey on iot-based big data analytics

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination