CN110147357A - The multi-source data polymerization methods of sampling and system under a kind of environment based on big data - Google Patents
The multi-source data polymerization methods of sampling and system under a kind of environment based on big data Download PDFInfo
- Publication number
- CN110147357A CN110147357A CN201910373940.8A CN201910373940A CN110147357A CN 110147357 A CN110147357 A CN 110147357A CN 201910373940 A CN201910373940 A CN 201910373940A CN 110147357 A CN110147357 A CN 110147357A
- Authority
- CN
- China
- Prior art keywords
- data
- source
- entity
- module
- sampling
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/11—File system administration, e.g. details of archiving or snapshots
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/17—Details of further file system functions
- G06F16/174—Redundancy elimination performed by the file system
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/182—Distributed file systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention belongs to big data technical field, the multi-source data polymerization methods of sampling and system under a kind of environment based on big data are disclosed, acquires multiple original data sources, each original data source includes data source name and at least one associated domain;The data source of acquisition is cleaned, identified, removes redundant operation;Using construction procedures according to original data source, original strategy list is obtained, the original strategy in original strategy list is ranked up, forms Policy List between data source;Separate sources data set is subjected to fusion treatment using fusion program;Fused file is segmented, the two-dimentional frequency matrix of file word is formed;The balanced verification numerical value of setting, circulation carry out roll-the-snowball sampling to each word;Utilize the multi-source data of display display acquisition.The present invention is dispatched by preprocessing module calculate node by Spark, is completed distributed computing, be can be realized more efficiently data prediction, practical, applied widely.
Description
Technical field
It polymerize the invention belongs to the multi-source data under big data technical field more particularly to a kind of environment based on big data and takes out
Quadrat method and system.
Background technique
Multisource data fusion technology, which refers to, is all integrated into one for all information investigated, analysis is got using correlation means
It rises, and carries out unified evaluation to information, finally obtain the technology of unified information.The purpose that the technical research comes out is will be each
Then the characteristics of different data information of kind is integrated, draws different data sources therefrom extracts unification, than single data
More preferably, richer information.However, in multi-source data polymerization sampling process under existing big data environment, to structural data,
For semi-structured, non-structured data prediction research is insufficient, and usually only comprising data acquisition and data cleansing two
A module, and the method for data cleansing is also fairly simple, cannot meet user demand well;Meanwhile when the fusion of data,
There is no open link data set as priori knowledge, efficiently and accurately can not be carried out on a large scale in the case where reducing more complicated degree
The fusion of heterogeneous data source.
In conclusion problem of the existing technology is:
In multi-source data polymerization sampling process under existing big data environment, to structural data, for semi-structured, non-
The data prediction of structuring studies deficiency, and usual includes data acquisition and two modules of data cleansing, and data
The method of cleaning is also fairly simple, cannot meet user demand well;Meanwhile when the fusion of data, there is no open link number
It is used as priori knowledge according to collection, melting for large scale scale heterogeneous data source can not be carried out by efficiently and accurately in the case where reducing more complicated degree
It closes.
Summary of the invention
In view of the problems of the existing technology, the present invention provides the multi-source data polymerizations under a kind of environment based on big data
The methods of sampling and system.
The invention is realized in this way the multi-source data under a kind of environment based on big data polymerize the methods of sampling, the base
Multi-source data under big data environment polymerize the methods of sampling
Separate sources data set is subjected to fusion treatment using fusion program by data fusion module;It is multiple next merging
When the solid data in source, canonical representation is carried out to the attribute of each data source respectively, which includes the mapping of synonymous attribute and
Unified conversion to the numerical value unit of attribute value;Piecemeal polymerization is carried out to entity based on physical name and entity attribute;By same point
The entity of separate sources will be matched as candidate entity pair using the similarity between entity alignment algorithm computational entity in block
The entity pair that same objective world is described into separate sources establishes the link of equal value of same entity between different data sources, and
The merging of entity attribute is carried out, and for entity exclusive in a data source, it is directly appended in knowledge base;
Fused file is segmented by word segmentation module, forms the two-dimentional frequency matrix of file word;
s.t.Xi=XiAi+Ei, i=1 ..., K
Wherein α is greater than 0 coefficient,For measuring normal word and abnormal word participle
Bring error;
It is equivalent to drag:
s.t.Xi=XiSi+Ei,
Ai=Ji,
Ai=Si, i=1 ..., K
Further, the multi-source data polymerization methods of sampling under the environment based on big data further comprises:
Step 1 acquires multiple original data sources by data source acquisition module, and each original data source includes data source
Title and at least one associated domain;
Step 2, central control module are carried out by preprocessing module using data source of the data processor to acquisition clear
It washes, identify, remove redundant operation;
Step 3 constructs module using construction procedures according to original data source by Policy List, obtains original strategy column
Table is ranked up the original strategy in original strategy list, forms Policy List between data source;
Separate sources data set is carried out fusion treatment using fusion program by data fusion module by step 4;
Step 5 is segmented fused file by word segmentation module, forms the two-dimentional frequency matrix of file word;
Step 6 chooses the seed root node key words of datum target guiding by decimation blocks using sample program,
Roll-the-snowball sampling depth is inputted, on the basis of seed root node data, sets balanced verification numerical value, circulation is to each word
Language carries out roll-the-snowball sampling;
Step 7 utilizes the multi-source data of display display acquisition by display module.
Further, preprocessing module processing method includes:
(1) distributed file system HDFS is uploaded to according to the data in preset condition extraction heterogeneous data source to be deposited
Storage;
(2) data in distributed file system HDFS are loaded by memory using Spark frame, remove repeated data,
Noise data carries out format conversion operation;
(3) to the data after cleaning, the different representation methods of the same entity are identified, correctly identify out it is all not
Same entity merges the data of same entity;
(4) data de-duplication technology based on cryptographic Hash is used, redundant data is removed.
Further, in the step (1), structuring, semi-structured, unstructured big number are read from heterogeneous data source
According to uploading to distributed file system HDFS and stored;
The format of the heterogeneous data source includes: Txt, Csv, Xsl, database data, jpg, mp4, and provides interface mark
Standard is to extend source of new data;
Textual data is read from text file by designing text storage function for text file, including Txt, Csv
According to storage is into distributed file system HDFS;
For Xsl file, by designing Xsl storage function, excel data, storage to distribution are read from Excel file
In formula file system HDFS;
For database data, including MySQL, Oracle, by database access interface ODBC or JDBC from database
Middle reading is stored into distributed file system HDFS;
Other kinds of file, including jpg, mp4 are read corresponding by designing corresponding file storage function
Data in data source are stored into distributed file system HDFS.
Further, in the step (2), the data cleansing, which refers to, handles frame based on Spark big data, will be distributed
Data in file system HDFS are loaded into memory, are denoised, duplicate removal, format conversion operation, and detailed process includes:
It reads data: data model is established based on SparkRDD/DataFrame, read the data in HDFS file, conversion
For RDD/DataFrame;
Repeated data: the data that read step generates is removed, removes repetition by design function or using built-in function
Data;
It removes noise data: freely configuring for combination condition judgment rule is realized using regulation engine, reduce or remove
Noise data, and effective information is avoided to lose;
Format conversion is carried out, converts unified format for the data of different-format.
Further, the progress canonical representation includes the method for normalizing of logarithm type attribute and date type attribute, described
The attribute value of date type attribute is collectively expressed as XX XX month XX day, and the specification of the attribute value of Numeric Attributes is mainly wrapped
Numerical value conversion and unit one or two of step of system are included, numerical value conversion refers to kilobit separator, the Chinese capitalization number in original numerical value
Situations such as word, is completely converted into Arabic numerals, and unit is unified then between progress numerical value conversion the not commensurate under same category;
It is described that piecemeal polymerization is carried out to entity based on physical name and entity attribute, it is necessary first to piecemeal is carried out to entity, it will
Consistent entity may be directed toward to being put into same, then using the entity of separate sources in same as candidate matches entity
It is right, compare whether the entity in different data sources is same reference two-by-two;
The piecemeal is grouped polymerization to entity using the partition strategy based on entity name and entity attribute, and described point
The specific process of group polymerization is, first according to entity name, entity name to be decomposed into binary model sequence;Secondly, for
Key value of the item as inverted index in each binary model sequence, which is inserted into the corresponding inverted index of this;
Then, it by the corresponding entity of key value each in inverted index, is divided again according to entity attribute, finally, if two differences
The entity in source has more than two identical attributes and attribute value, then is subdivided into same piecemeal.
Another object of the present invention is to provide polymerize to take out based on the multi-source data under big data environment described in a kind of realize
The information data processing terminal of quadrat method.
Another object of the present invention is to provide a kind of computer readable storage mediums, including instruction, when it is in computer
When upper operation, so that polymerizeing the methods of sampling based on the multi-source data under big data environment described in computer execution.
Another object of the present invention is to provide polymerize to take out based on the multi-source data under big data environment described in a kind of implementation
Quadrat method polymerize sampling system based on the multi-source data under big data environment, the multi-source data under the environment based on big data
Polymerizeing sampling system includes:
Data source acquisition module, connect with central control module, for acquiring multiple original data sources, each initial data
Source includes data source name and at least one associated domain;
Central control module constructs module, data fusion mould with data source acquisition module, preprocessing module, Policy List
Block, word segmentation module, decimation blocks, display module connection, work normally for controlling modules by central processing unit;
Preprocessing module is connect with central control module, for being carried out by data source of the data processor to acquisition
Cleaning, identification, removal redundant operation;
Policy List constructs module, connect with central control module, for, according to original data source, being obtained by construction procedures
Original strategy list is taken, the original strategy in original strategy list is ranked up, forms Policy List between data source;
Data fusion module is connect with central control module, for being carried out separate sources data set by fusion program
Fusion treatment;
Word segmentation module is connect with central control module, for segmenting fused file, forms file word
Two-dimentional frequency matrix;
Decimation blocks are connect with central control module, for choosing the seminal root of datum target guiding by sample program
Node key words input roll-the-snowball sampling depth, on the basis of seed root node data, set balanced verification numerical value, follow
Ring carries out roll-the-snowball sampling to each word;
Display module is connect with central control module, for the multi-source data by display display acquisition.
Another object of the present invention is to provide polymerize to take out based on the multi-source data under big data environment described in a kind of implementation
Quadrat method polymerize sampling Cloud Server based on the multi-source data under big data environment.
Advantages of the present invention and good effect are as follows:
The present invention handles frame using Spark big data by preprocessing module and pre-processes to big data, not only may be used
To reduce storage resource and network bandwidth, data storage efficiency is improved, and can be improved the quality of subsequent data analysis work.
Spark frame supports that datarams are resident, can be improved read or write speed by building elasticity distribution formula data set (RDD) structure.
Calculate node is dispatched by Spark, is completed distributed computing, be can be realized more efficiently data prediction, practical, is applicable in model
It encloses wide;Meanwhile the data of needs can be extracted from the knowledge base of multiple and different field different scales by data fusion module
Partial data source needed for constituting support applications by fusion, the data fusion of multiple data sources is got up, and merges redundancy,
Expand useful information.
Separate sources data set is carried out fusion treatment using fusion program by data fusion module by the present invention;It is merging
When the solid data in multiple sources, canonical representation is carried out to the attribute of each data source respectively, which includes synonymous attributes
The unified conversion of mapping and the numerical value unit to attribute value;Piecemeal polymerization is carried out to entity based on physical name and entity attribute;It will
The entity of separate sources will using the similarity between entity alignment algorithm computational entity as candidate entity pair in same piecemeal
Matching obtains the entity pair that same objective world is described in separate sources, establishes the chain of equal value of same entity between different data sources
It connects, and carries out the merging of entity attribute, and for entity exclusive in a data source, it is directly appended in knowledge base;
Fused file is segmented by word segmentation module, forms the two-dimentional frequency matrix of file word;
s.t.Xi=XiAi+Ei, i=1 ..., K
Wherein α is greater than 0 coefficient,For measuring normal word and abnormal word participle
Bring error;
It is equivalent to drag:
s.t.Xi=XiSi+Ei,
Ai=Ji,
Ai=Si, i=1 ..., K
Large scale scale heterogeneous data can not be carried out by efficiently and accurately in the case where reducing more complicated degree by solving the prior art
The fusion problem in source.
Detailed description of the invention
Fig. 1 is the multi-source data polymerization methods of sampling flow chart under the environment provided in an embodiment of the present invention based on big data.
Fig. 2 is the multi-source data polymerization sampling system structural frames under the environment provided in an embodiment of the present invention based on big data
Figure.
In figure: 1, data source acquisition module;2, central control module;3, preprocessing module;4, Policy List constructs module;
5, data fusion module;6, word segmentation module;7, decimation blocks;8, display module.
Specific embodiment
In order to further understand the content, features and effects of the present invention, the following examples are hereby given, and cooperate attached drawing
Detailed description includes.
Structure of the invention is explained in detail with reference to the accompanying drawing.
As shown in Figure 1, the multi-source data polymerization methods of sampling under the environment provided by the invention based on big data includes following
Step:
S101 acquires multiple original data sources by data source acquisition module, and each original data source includes dataSource link
Claim and at least one associated domain;
S102, central control module are carried out by preprocessing module using data source of the data processor to acquisition clear
It washes, identify, remove redundant operation;
S103 constructs module using construction procedures according to original data source by Policy List, obtains original strategy list,
Original strategy in original strategy list is ranked up, Policy List between data source is formed;
Separate sources data set is carried out fusion treatment using fusion program by data fusion module by S104;
S105 is segmented fused file by word segmentation module, forms the two-dimentional frequency matrix of file word;
S106 chooses the seed root node key words of datum target guiding by decimation blocks using sample program, defeated
Enter roll-the-snowball sampling depth, on the basis of seed root node data, set balanced verification numerical value, circulation to each word,
Carry out roll-the-snowball sampling;
S107 utilizes the multi-source data of display display acquisition by display module.
In step S104, separate sources data set is carried out using fusion program by fusion treatment by data fusion module;
When merging the solid data in multiple sources, canonical representation is carried out to the attribute of each data source respectively, which includes same
The unified conversion of adopted attribute mapping and the numerical value unit to attribute value;It is poly- that piecemeal is carried out to entity based on physical name and entity attribute
It closes;Using the entity of separate sources in same piecemeal as candidate entity pair, using similar between entity alignment algorithm computational entity
Matching is obtained the entity pair for describing same objective world in separate sources, establishes same entity between different data sources by degree
Equivalence link, and the merging of entity attribute is carried out, and for entity exclusive in a data source, it is directly appended to knowledge base
In;
Fused file is segmented by word segmentation module, forms the two-dimentional frequency matrix of file word;
s.t.Xi=XiAi+Ei, i=1 ..., K
Wherein α is greater than 0 coefficient,For measuring normal word and abnormal word participle
Bring error;
It is equivalent to drag:
s.t.Xi=XiSi+Ei,
Ai=Ji,
Ai=Si, i=1 ..., K
As shown in Fig. 2, the multi-source data under the environment provided in an embodiment of the present invention based on big data polymerize sampling system packet
Include: data source acquisition module 1, central control module 2, preprocessing module 3, Policy List building module 4, data fusion module 5,
Word segmentation module 6, decimation blocks 7, display module 8.
Data source acquisition module 1 is connect with central control module 2, for acquiring multiple original data sources, each original number
It include data source name and at least one associated domain according to source;
Central control module 2 melts with data source acquisition module 1, preprocessing module 3, Policy List building module 4, data
Block 5, word segmentation module 6, decimation blocks 7, display module 8 is molded to connect, it is normal for controlling modules by central processing unit
Work;
Preprocessing module 3 is connect with central control module 2, for by data processor to the data source of acquisition into
Row cleaning, identification, removal redundant operation;
Policy List constructs module 4, connect with central control module 2, for passing through construction procedures according to original data source,
Original strategy list is obtained, the original strategy in original strategy list is ranked up, forms Policy List between data source;
Data fusion module 5 is connect with central control module 2, for by fusion program by separate sources data set into
Row fusion treatment;
Word segmentation module 6 is connect with central control module 2, for segmenting fused file, forms file word
Two-dimentional frequency matrix;
Decimation blocks 7 are connect with central control module 2, for choosing the seed of datum target guiding by sample program
Root node key words input roll-the-snowball sampling depth, on the basis of seed root node data, set balanced verification numerical value,
Circulation carries out roll-the-snowball sampling to each word;
Display module 8 is connect with central control module 2, for the multi-source data by display display acquisition.
3 processing method of preprocessing module provided by the invention includes:
(1) distributed file system HDFS is uploaded to according to the data in preset condition extraction heterogeneous data source to be deposited
Storage;
(2) data in distributed file system HDFS are loaded by memory using Spark frame, remove repeated data,
Noise data carries out format conversion operation;
(3) to the data after cleaning, the different representation methods of the same entity are identified, correctly identify out it is all not
Same entity merges the data of same entity;
(4) data de-duplication technology based on cryptographic Hash is used, redundant data is removed.
In step (1) provided by the invention, structuring, semi-structured, unstructured big number are read from heterogeneous data source
According to uploading to distributed file system HDFS and stored;
The format of the heterogeneous data source includes: Txt, Csv, Xsl, database data, jpg, mp4, and provides interface mark
Standard is to extend source of new data;
Textual data is read from text file by designing text storage function for text file, including Txt, Csv
According to storage is into distributed file system HDFS;
For Xsl file, by designing Xsl storage function, excel data, storage to distribution are read from Excel file
In formula file system HDFS;
For database data, including MySQL, Oracle, by database access interface ODBC or JDBC from database
Middle reading is stored into distributed file system HDFS;
Other kinds of file, including jpg, mp4 are read corresponding by designing corresponding file storage function
Data in data source are stored into distributed file system HDFS.
In step (2) provided by the invention, the data cleansing, which refers to, handles frame based on Spark big data, will be distributed
Data in formula file system HDFS are loaded into memory, are denoised, duplicate removal, format conversion operation, and detailed process includes:
It reads data: data model is established based on SparkRDD/DataFrame, read the data in HDFS file, conversion
For RDD/DataFrame;
Repeated data: the data that read step generates is removed, removes repetition by design function or using built-in function
Data;
It removes noise data: freely configuring for combination condition judgment rule is realized using regulation engine, reduce or remove
Noise data, and effective information is avoided to lose;
Format conversion is carried out, converts unified format for the data of different-format.
5 fusion method of data fusion module provided by the invention includes:
1) when merging the solid data in multiple sources, canonical representation is carried out to the attribute of each data source respectively,
In include the mapping of synonymous attribute and the unified of numerical value unit of attribute value converted;
2) piecemeal polymerization is carried out to entity based on physical name and entity attribute;
3) using the entity of separate sources in same piecemeal as candidate entity pair, using between entity alignment algorithm computational entity
Similarity, matching is obtained describing in separate sources the entity pair of same objective world, is established same between different data sources
The link of equal value of entity, and the merging of entity attribute is carried out, and for entity exclusive in a data source, it is directly appended to know
Know in library.
Progress canonical representation provided by the invention includes the method for normalizing of logarithm type attribute and date type attribute, described
The attribute value of date type attribute is collectively expressed as XX XX month XX day, and the specification of the attribute value of Numeric Attributes is mainly wrapped
Numerical value conversion and unit one or two of step of system are included, numerical value conversion refers to kilobit separator, the Chinese capitalization number in original numerical value
Situations such as word, is completely converted into Arabic numerals, and unit is unified then between progress numerical value conversion the not commensurate under same category.
It is provided by the invention that piecemeal polymerization is carried out to entity based on physical name and entity attribute, it is necessary first to which that entity is carried out
Piecemeal, it would be possible to be directed toward consistent entity to being put into same, then using the entity of separate sources in same as candidate
With entity pair, compare whether the entity in different data sources is same reference two-by-two.
Piecemeal provided by the invention is grouped entity using the partition strategy based on entity name and entity attribute poly-
It closes, the specific process of the packet aggregation is, first according to entity name, entity name is decomposed into binary model sequence;
Secondly, the key value for the item in each binary model sequence as inverted index, it is corresponding to be inserted into this for the entity
In row's index;Then, it by the corresponding entity of key value each in inverted index, is divided again according to entity attribute, finally, such as
The entity of two separate sources of fruit has more than two identical attributes and attribute value, then is subdivided into same piecemeal.
In the above-described embodiments, can come wholly or partly by software, hardware, firmware or any combination thereof real
It is existing.When using entirely or partly realizing in the form of a computer program product, the computer program product include one or
Multiple computer instructions.When loading on computers or executing the computer program instructions, entirely or partly generate according to
Process described in the embodiment of the present invention or function.The computer can be general purpose computer, special purpose computer, computer network
Network or other programmable devices.The computer instruction may be stored in a computer readable storage medium, or from one
Computer readable storage medium is transmitted to another computer readable storage medium, for example, the computer instruction can be from one
A web-site, computer, server or data center pass through wired (such as coaxial cable, optical fiber, Digital Subscriber Line (DSL)
Or wireless (such as infrared, wireless, microwave etc.) mode is carried out to another web-site, computer, server or data center
Transmission).The computer-readable storage medium can be any usable medium or include one that computer can access
The data storage devices such as a or multiple usable mediums integrated server, data center.The usable medium can be magnetic Jie
Matter, (for example, floppy disk, hard disk, tape), optical medium (for example, DVD) or semiconductor medium (such as solid state hard disk Solid
State Disk (SSD)) etc..
The above is only the preferred embodiments of the present invention, and is not intended to limit the present invention in any form,
Any simple modification made to the above embodiment according to the technical essence of the invention, equivalent variations and modification, belong to
In the range of technical solution of the present invention.
Claims (10)
1. the multi-source data under a kind of environment based on big data polymerize the methods of sampling, which is characterized in that described to be based on big data ring
Multi-source data under border polymerize the methods of sampling
Separate sources data set is subjected to fusion treatment using fusion program by data fusion module;Merging multiple sources
When solid data, canonical representation is carried out to the attribute of each data source respectively, which includes the mapping of synonymous attribute and to category
Property value numerical value unit unified conversion;Piecemeal polymerization is carried out to entity based on physical name and entity attribute;It will be in same piecemeal
The entity of separate sources is obtained matching not using the similarity between entity alignment algorithm computational entity as candidate entity pair
With the entity pair for describing same objective world in source, the link of equal value of same entity between different data sources is established, and is carried out
The merging of entity attribute, and for entity exclusive in a data source, it is directly appended in knowledge base;
Fused file is segmented by word segmentation module, forms the two-dimentional frequency matrix of file word;
s.t.Xi=XiAi+Ei, i=1 ..., K
Wherein α is greater than 0 coefficient,It is brought for measuring normal word and abnormal word participle
Error;
It is equivalent to drag:
2. polymerizeing the methods of sampling based on the multi-source data under big data environment as described in claim 1, which is characterized in that the base
The multi-source data polymerization methods of sampling under big data environment further comprises:
Step 1 acquires multiple original data sources by data source acquisition module, and each original data source includes data source name
With at least one associated domain;
Step 2, central control module cleaned by preprocessing module using data source of the data processor to acquisition,
Identification, removal redundant operation;
Step 3 constructs module using construction procedures according to original data source by Policy List, obtains original strategy list, right
Original strategy in original strategy list is ranked up, and forms Policy List between data source;
Separate sources data set is carried out fusion treatment using fusion program by data fusion module by step 4;
Step 5 is segmented fused file by word segmentation module, forms the two-dimentional frequency matrix of file word;
Step 6 chooses the seed root node key words of datum target guiding, input by decimation blocks using sample program
Roll-the-snowball sampling depth sets balanced verification numerical value on the basis of seed root node data, circulation to each word, into
Row roll-the-snowball sampling;
Step 7 utilizes the multi-source data of display display acquisition by display module.
3. polymerizeing the methods of sampling based on the multi-source data under big data environment as claimed in claim 2, which is characterized in that pretreatment
Resume module method includes:
(1) distributed file system HDFS is uploaded to according to the data in preset condition extraction heterogeneous data source to be stored;
(2) data in distributed file system HDFS are loaded by memory using Spark frame, remove repeated data, noise
Data carry out format conversion operation;
(3) to the data after cleaning, the different representation methods of the same entity are identified, what is correctly identified out is all different real
Body merges the data of same entity;
(4) data de-duplication technology based on cryptographic Hash is used, redundant data is removed.
4. polymerizeing the methods of sampling based on the multi-source data under big data environment as claimed in claim 3, which is characterized in that the step
Suddenly in (1), structuring, semi-structured, unstructured big data are read from heterogeneous data source, uploads to distributed file system
HDFS is stored;
The format of the heterogeneous data source includes: Txt, Csv, Xsl, database data, jpg, mp4, and provide interface standard with
Just source of new data is extended;
Are read by text data from text file, is deposited by designing text storage function for text file, including Txt, Csv
It stores up in distributed file system HDFS;
For Xsl file, by designing Xsl storage function, excel data, storage to distributed text are read from Excel file
In part system HDFS;
For database data, including MySQL, Oracle, read from database by database access interface ODBC or JDBC
It takes, stores into distributed file system HDFS;
Corresponding data are read by designing corresponding file storage function for other kinds of file, including jpg, mp4
Data in source are stored into distributed file system HDFS.
5. polymerizeing the methods of sampling based on the multi-source data under big data environment as claimed in claim 3, which is characterized in that the step
Suddenly in (2), the data cleansing, which refers to, handles frame based on Spark big data, by the data in distributed file system HDFS
It is loaded into memory, is denoised, duplicate removal, format conversion operates, and detailed process includes:
It reads data: data model being established based on SparkRDD/DataFrame, the data in HDFS file is read, is converted into
RDD/DataFrame;
Repeated data: the data that read step generates is removed, removes repeated data by design function or using built-in function;
It removes noise data: freely configuring for combination condition judgment rule is realized using regulation engine, reduce or remove noise
Data, and effective information is avoided to lose;
Format conversion is carried out, converts unified format for the data of different-format.
6. as claimed in claim 5 based under big data environment multi-source data polymerize the methods of sampling, which is characterized in that it is described into
Row canonical representation includes the method for normalizing of logarithm type attribute and date type attribute, and the attribute value of the date type attribute is united
One is expressed as XX XX month XX day, and the specification for the attribute value of Numeric Attributes mainly includes numerical value conversion and unit unified two
A step, numerical value conversion refer to by original numerical value kilobit separator, Chinese word figure situations such as be completely converted into me
Uncle's number, unit are unified then between progress numerical value conversion the not commensurate under same category;
It is described that piecemeal polymerization is carried out to entity based on physical name and entity attribute, it is necessary first to which that piecemeal is carried out to entity, it would be possible to
Consistent entity is directed toward to being put into same, then using the entity of separate sources in same as candidate matches entity pair, two
Whether two entities compared in different data sources are same references;
The piecemeal is grouped polymerization to entity using the partition strategy based on entity name and entity attribute, and the grouping is poly-
The specific process closed is, first according to entity name, entity name to be decomposed into binary model sequence;Secondly, for each
Key value of the item as inverted index in binary model sequence, which is inserted into the corresponding inverted index of this;So
Afterwards, it by the corresponding entity of key value each in inverted index, is divided again according to entity attribute, finally, if two differences are come
The entity in source has more than two identical attributes and attribute value, then is subdivided into same piecemeal.
7. a kind of realize polymerize the methods of sampling based on the multi-source data under big data environment described in claim 1~6 any one
Information data processing terminal.
8. a kind of computer readable storage medium, including instruction, when run on a computer, so that computer is executed as weighed
Benefit requires to polymerize the methods of sampling based on the multi-source data under big data environment described in 1-6 any one.
9. a kind of implement described in claim 1 based on the multi-source data polymerization methods of sampling under big data environment based on big data
Multi-source data under environment polymerize sampling system, which is characterized in that the multi-source data under the environment based on big data, which polymerize, to be taken out
Sample system includes:
Data source acquisition module, connect with central control module, for acquiring multiple original data sources, each original data source packet
Include data source name and at least one associated domain;
Central control module constructs module with data source acquisition module, preprocessing module, Policy List, data fusion module, divides
Word module, decimation blocks, display module connection, work normally for controlling modules by central processing unit;
Preprocessing module is connect with central control module, for being cleaned by data source of the data processor to acquisition,
Identification, removal redundant operation;
Policy List constructs module, connect with central control module, for, according to original data source, being obtained former by construction procedures
Beginning Policy List is ranked up the original strategy in original strategy list, forms Policy List between data source;
Data fusion module is connect with central control module, for being merged separate sources data set by fusion program
Processing;
Word segmentation module is connect with central control module, for segmenting fused file, forms the two dimension of file word
Frequency matrix;
Decimation blocks are connect with central control module, for choosing the seed root node of datum target guiding by sample program
Key words input roll-the-snowball sampling depth, on the basis of seed root node data, set balanced verification numerical value, circulation
To each word, roll-the-snowball sampling is carried out;
Display module is connect with central control module, for the multi-source data by display display acquisition.
10. a kind of implement described in claim 1 based on the multi-source data polymerization methods of sampling under big data environment based on big number
According to the multi-source data polymerization sampling Cloud Server under environment.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910373940.8A CN110147357A (en) | 2019-05-07 | 2019-05-07 | The multi-source data polymerization methods of sampling and system under a kind of environment based on big data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910373940.8A CN110147357A (en) | 2019-05-07 | 2019-05-07 | The multi-source data polymerization methods of sampling and system under a kind of environment based on big data |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110147357A true CN110147357A (en) | 2019-08-20 |
Family
ID=67594665
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910373940.8A Pending CN110147357A (en) | 2019-05-07 | 2019-05-07 | The multi-source data polymerization methods of sampling and system under a kind of environment based on big data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110147357A (en) |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110515926A (en) * | 2019-08-28 | 2019-11-29 | 国网天津市电力公司 | Heterogeneous data source mass data carding method based on participle and semantic dependency analysis |
CN110597879A (en) * | 2019-09-17 | 2019-12-20 | 第四范式(北京)技术有限公司 | Method and device for processing time series data |
CN111400569A (en) * | 2020-03-13 | 2020-07-10 | 重庆特斯联智慧科技股份有限公司 | Big data analysis method and system of multi-source aggregation structure |
CN111431967A (en) * | 2020-02-25 | 2020-07-17 | 天宇经纬(北京)科技有限公司 | Multi-source heterogeneous data representation and distribution method and device based on business rules |
CN111581281A (en) * | 2020-04-24 | 2020-08-25 | 贵州力创科技发展有限公司 | Data fusion method and device |
CN111639054A (en) * | 2020-05-29 | 2020-09-08 | 中国人民解放军国防科技大学 | Data coupling method, system and medium for ocean mode and data assimilation |
CN111708773A (en) * | 2020-08-13 | 2020-09-25 | 江苏宝和数据股份有限公司 | Multi-source scientific and creative resource data fusion method |
CN111966571A (en) * | 2020-08-12 | 2020-11-20 | 重庆邮电大学 | Time estimation cooperative processing method based on ARM-FPGA coprocessor heterogeneous platform |
CN111985578A (en) * | 2020-09-02 | 2020-11-24 | 深圳壹账通智能科技有限公司 | Multi-source data fusion method and device, computer equipment and storage medium |
CN112214573A (en) * | 2020-10-30 | 2021-01-12 | 数贸科技(北京)有限公司 | Information search system, method, computing device, and computer storage medium |
CN112486989A (en) * | 2020-11-28 | 2021-03-12 | 河北省科学技术情报研究院(河北省科技创新战略研究院) | Multi-source data granulation fusion and index classification and layering processing method |
CN112579770A (en) * | 2019-09-30 | 2021-03-30 | 北京国双科技有限公司 | Knowledge graph generation method, device, storage medium and equipment |
WO2021135323A1 (en) * | 2020-07-31 | 2021-07-08 | 平安科技(深圳)有限公司 | Method and apparatus for fusion processing of municipal multi-source heterogeneous data, and computer device |
CN113315813A (en) * | 2021-05-08 | 2021-08-27 | 重庆第二师范学院 | Information exchange method and system for big data internet information chain system |
CN113609715A (en) * | 2021-10-11 | 2021-11-05 | 深圳奥雅设计股份有限公司 | Multivariate model data fusion method and system under digital twin background |
CN114896963A (en) * | 2022-07-08 | 2022-08-12 | 北京百炼智能科技有限公司 | Data processing method and device, electronic equipment and storage medium |
US11449514B2 (en) | 2019-12-27 | 2022-09-20 | Interset Software LLC | Approximate aggregation queries |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107066534A (en) * | 2017-03-02 | 2017-08-18 | 人谷科技(北京)有限责任公司 | Multi-source data polymerization and system |
CN107451282A (en) * | 2017-08-09 | 2017-12-08 | 南京审计大学 | A kind of multi-source data polymerization Sampling Strategies under the environment based on big data |
CN107633075A (en) * | 2017-09-22 | 2018-01-26 | 吉林大学 | A kind of multi-source heterogeneous data fusion platform and fusion method |
CN108470074A (en) * | 2018-04-04 | 2018-08-31 | 河北北方学院 | A kind of multi-source data under the environment based on big data polymerize sampling system |
CN108647318A (en) * | 2018-05-10 | 2018-10-12 | 北京航空航天大学 | A kind of knowledge fusion method based on multi-source data |
CN109165202A (en) * | 2018-07-04 | 2019-01-08 | 华南理工大学 | A kind of preprocess method of multi-source heterogeneous big data |
-
2019
- 2019-05-07 CN CN201910373940.8A patent/CN110147357A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107066534A (en) * | 2017-03-02 | 2017-08-18 | 人谷科技(北京)有限责任公司 | Multi-source data polymerization and system |
CN107451282A (en) * | 2017-08-09 | 2017-12-08 | 南京审计大学 | A kind of multi-source data polymerization Sampling Strategies under the environment based on big data |
CN107633075A (en) * | 2017-09-22 | 2018-01-26 | 吉林大学 | A kind of multi-source heterogeneous data fusion platform and fusion method |
CN108470074A (en) * | 2018-04-04 | 2018-08-31 | 河北北方学院 | A kind of multi-source data under the environment based on big data polymerize sampling system |
CN108647318A (en) * | 2018-05-10 | 2018-10-12 | 北京航空航天大学 | A kind of knowledge fusion method based on multi-source data |
CN109165202A (en) * | 2018-07-04 | 2019-01-08 | 华南理工大学 | A kind of preprocess method of multi-source heterogeneous big data |
Cited By (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110515926A (en) * | 2019-08-28 | 2019-11-29 | 国网天津市电力公司 | Heterogeneous data source mass data carding method based on participle and semantic dependency analysis |
CN110597879A (en) * | 2019-09-17 | 2019-12-20 | 第四范式(北京)技术有限公司 | Method and device for processing time series data |
CN110597879B (en) * | 2019-09-17 | 2022-01-14 | 第四范式(北京)技术有限公司 | Method and device for processing time series data |
CN112579770A (en) * | 2019-09-30 | 2021-03-30 | 北京国双科技有限公司 | Knowledge graph generation method, device, storage medium and equipment |
US11449514B2 (en) | 2019-12-27 | 2022-09-20 | Interset Software LLC | Approximate aggregation queries |
CN111431967A (en) * | 2020-02-25 | 2020-07-17 | 天宇经纬(北京)科技有限公司 | Multi-source heterogeneous data representation and distribution method and device based on business rules |
CN111400569A (en) * | 2020-03-13 | 2020-07-10 | 重庆特斯联智慧科技股份有限公司 | Big data analysis method and system of multi-source aggregation structure |
CN111581281A (en) * | 2020-04-24 | 2020-08-25 | 贵州力创科技发展有限公司 | Data fusion method and device |
CN111639054A (en) * | 2020-05-29 | 2020-09-08 | 中国人民解放军国防科技大学 | Data coupling method, system and medium for ocean mode and data assimilation |
CN111639054B (en) * | 2020-05-29 | 2023-11-07 | 中国人民解放军国防科技大学 | Data coupling method, system and medium for ocean mode and data assimilation |
WO2021135323A1 (en) * | 2020-07-31 | 2021-07-08 | 平安科技(深圳)有限公司 | Method and apparatus for fusion processing of municipal multi-source heterogeneous data, and computer device |
CN111966571A (en) * | 2020-08-12 | 2020-11-20 | 重庆邮电大学 | Time estimation cooperative processing method based on ARM-FPGA coprocessor heterogeneous platform |
CN111966571B (en) * | 2020-08-12 | 2023-05-12 | 重庆邮电大学 | Time estimation cooperative processing method based on ARM-FPGA coprocessor heterogeneous platform |
CN111708773A (en) * | 2020-08-13 | 2020-09-25 | 江苏宝和数据股份有限公司 | Multi-source scientific and creative resource data fusion method |
CN111985578A (en) * | 2020-09-02 | 2020-11-24 | 深圳壹账通智能科技有限公司 | Multi-source data fusion method and device, computer equipment and storage medium |
CN112214573A (en) * | 2020-10-30 | 2021-01-12 | 数贸科技(北京)有限公司 | Information search system, method, computing device, and computer storage medium |
CN112486989A (en) * | 2020-11-28 | 2021-03-12 | 河北省科学技术情报研究院(河北省科技创新战略研究院) | Multi-source data granulation fusion and index classification and layering processing method |
CN113315813A (en) * | 2021-05-08 | 2021-08-27 | 重庆第二师范学院 | Information exchange method and system for big data internet information chain system |
CN113609715A (en) * | 2021-10-11 | 2021-11-05 | 深圳奥雅设计股份有限公司 | Multivariate model data fusion method and system under digital twin background |
CN114896963A (en) * | 2022-07-08 | 2022-08-12 | 北京百炼智能科技有限公司 | Data processing method and device, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110147357A (en) | The multi-source data polymerization methods of sampling and system under a kind of environment based on big data | |
JP7170779B2 (en) | Methods and systems for automatic intent mining, classification, and placement | |
CA2953969C (en) | Interactive interfaces for machine learning model evaluations | |
CN103336790B (en) | Hadoop-based fast neighborhood rough set attribute reduction method | |
CN109739939A (en) | The data fusion method and device of knowledge mapping | |
US20150379430A1 (en) | Efficient duplicate detection for machine learning data sets | |
US20150379429A1 (en) | Interactive interfaces for machine learning model evaluations | |
CN109165202A (en) | A kind of preprocess method of multi-source heterogeneous big data | |
CN113032579B (en) | Metadata blood relationship analysis method and device, electronic equipment and medium | |
US20230139783A1 (en) | Schema-adaptable data enrichment and retrieval | |
KR102219955B1 (en) | Behavior-based platform system using the bigdata | |
CN110990467B (en) | BIM model format conversion method and conversion system | |
CN111627552B (en) | Medical streaming data blood-edge relationship analysis and storage method and device | |
CN114462623B (en) | Data analysis method, system and platform based on edge calculation | |
CN110928981A (en) | Method, system and storage medium for establishing and perfecting iteration of text label system | |
Cong | Personalized recommendation of film and television culture based on an intelligent classification algorithm | |
JP7347179B2 (en) | Methods, devices and computer programs for extracting web page content | |
CN115146062A (en) | Intelligent event analysis method and system fusing expert recommendation and text clustering | |
CN116522912B (en) | Training method, device, medium and equipment for package design language model | |
Ravichandran | Big Data processing with Hadoop: a review | |
JP2022168859A (en) | Computer implementation method, computer program, and system (prediction query processing) | |
CN110543467B (en) | Mode conversion method and device for time series database | |
US11514321B1 (en) | Artificial intelligence system using unsupervised transfer learning for intra-cluster analysis | |
CN109086373B (en) | Method for constructing fair link prediction evaluation system | |
Shouaib et al. | Survey on iot-based big data analytics |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |