CN106250556A - Data digging method for big data analysis - Google Patents
Data digging method for big data analysis Download PDFInfo
- Publication number
- CN106250556A CN106250556A CN201610675596.4A CN201610675596A CN106250556A CN 106250556 A CN106250556 A CN 106250556A CN 201610675596 A CN201610675596 A CN 201610675596A CN 106250556 A CN106250556 A CN 106250556A
- Authority
- CN
- China
- Prior art keywords
- data
- processing terminal
- network
- decoding
- cleansing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 30
- 238000007405 data analysis Methods 0.000 title claims abstract description 11
- 238000012545 processing Methods 0.000 claims abstract description 29
- 230000005856 abnormality Effects 0.000 claims abstract description 4
- 238000004140 cleaning Methods 0.000 claims abstract description 4
- 238000001514 detection method Methods 0.000 claims abstract description 4
- 230000008030 elimination Effects 0.000 claims abstract description 4
- 238000003379 elimination reaction Methods 0.000 claims abstract description 4
- 238000012790 confirmation Methods 0.000 claims description 10
- 230000002159 abnormal effect Effects 0.000 claims description 2
- 238000005201 scrubbing Methods 0.000 claims description 2
- 238000007619 statistical method Methods 0.000 claims description 2
- 238000004458 analytical method Methods 0.000 claims 1
- 238000001914 filtration Methods 0.000 abstract description 4
- 230000005540 biological transmission Effects 0.000 description 4
- 244000141353 Prunus domestica Species 0.000 description 3
- 230000005611 electricity Effects 0.000 description 3
- 101001072091 Homo sapiens ProSAAS Proteins 0.000 description 2
- 101150096185 PAAS gene Proteins 0.000 description 2
- 102100036366 ProSAAS Human genes 0.000 description 2
- 230000001010 compromised effect Effects 0.000 description 2
- 238000010276 construction Methods 0.000 description 2
- 238000013075 data extraction Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 241001269238 Data Species 0.000 description 1
- 240000007594 Oryza sativa Species 0.000 description 1
- 235000007164 Oryza sativa Nutrition 0.000 description 1
- 238000009825 accumulation Methods 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 235000009566 rice Nutrition 0.000 description 1
- 238000002834 transmittance Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Quality & Reliability (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computer And Data Communications (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a kind of data digging method for big data analysis, first build the abnormality detection for big data cleansing and the platform of elimination, it includes data source;Described data source is connected with processing terminal by network, includes data acquisition module and data cleaning module in described processing terminal;Then, after data acquisition module sends the request of request data to data source, described data source just transmits data to processing terminal, and processing terminal subsequent start-up data cleansing module carries out data cleansing to described data.Achieve in conjunction with remaining method and data are extracted the purpose of valid data by filtering " denoising ".
Description
Technical field
The present invention relates to a kind of data cleansing technical field, especially relate to a kind of data mining for big data analysis
Method.
Background technology
Being proposed the precious big data trade platform of a kind of data currently on the market, the precious big data trade platform of data is class electricity business
The big data trade platform of pattern, is traded based on commodity such as big data DAAS, PAAS and SAAS, and operation mode is on one's own account
+ businessman enters.
Wherein DAAS data, services is provided with 6 nearly 200 achievement datas of theme 19 class industry;PAAS application is government's machine
Structure, enterprise and institution, individual team provide the hosts applications and secondary development customized, and reduce client's construction cost;SAAS product is
For many years in technical capability, industry field, solution accumulation, respectively from its vertical industry extractive technique framework associated, industry
General character and the universal product that formed, platform for product trading provide feasible, convenient, safely, save worry, efficient business model.
The precious big data trade platform of data is devoted to build the big data industry electricity business's transaction platform for domestically leading.
Data treasured is more than a website, and is electricity business's platform.
Can access the big data trade of other businessmans, businessman can include database service interface, data with product sold
Application and big data product.For client: platform rice steamer selects high-quality producer and service, to third-party product quality side entirely
Position is responsible for.For businessman: platform provides unique modality for co-operation, the marketing channel of full media.
And data source is the basis of big data analysis and data trade, the data source that platform comprises has: accumulate for many years
All kinds of shared data that all kinds of every profession and trade data and each affiliate (such as Baidu, aggregated data, institute of China etc.) reach, both at home and abroad
Government and the various open data of mechanism and various places government and the data of operators in co-operation.
But the information come by data collecting module collected from data source, for big data, and is not all valuable,
Some data is not our content of interest, and other data are then full of prunes distracters, therefore will be to data
Valid data are extracted by filtering " denoising ".
Summary of the invention
The technical problem to be solved is to provide a kind of data digging method for big data analysis, it is achieved
Data are extracted the purpose of valid data by filtering " denoising ".
For solving above-mentioned technical problem, the technical solution of the present invention is:
A kind of data digging method for big data analysis, specific as follows:
First building the abnormality detection for big data cleansing and the platform of elimination, it includes data source;
Described data source is connected with processing terminal by network, described processing terminal includes data acquisition module and
Data cleansing module;
Then, after data acquisition module sends the request of request data to data source, described data source just transmits data to
Processing terminal, and processing terminal subsequent start-up data cleansing module carries out data cleansing to described data.
So by data cleansing mainly complete the discrimination to data accepted, extract, the operation such as cleaning.1) extraction: because of
The data obtained are likely to be of various structures and type, and data extraction process can help us the data of these complexity to be converted
For single or be easy to the configuration processed, to reach the purpose quickly analyzed and processed.2) clean: for big data, the most entirely
Being valuable, some data is not content of interest, and other data are then full of prunes distracters, therefore wants
Data are extracted valid data by filtering " denoising ".
Detailed description of the invention
In order to make the purpose of the present invention, technical scheme and advantage clearer, below in conjunction with embodiment, to the present invention
It is further elaborated.Should be appreciated that specific embodiment described herein, only in order to explain the present invention, is not used to
Limit the present invention.
For the data digging method of big data analysis, specific as follows:
First building the abnormality detection for big data cleansing and the platform of elimination, it includes data source;
Described data source is connected with processing terminal by network, described processing terminal includes data acquisition module and
Data cleansing module;
Then, after data acquisition module sends the request of request data to data source, described data source just transmits data to
Processing terminal, and processing terminal subsequent start-up data cleansing module carries out data cleansing to described data.
The mode of described data cleansing be use statistical method to detect the Numeric Attributes of described data, calculated field value
Average and standard deviation, utilize the confidence interval of each field to identify exception field and record, by data digging method introduce
Data scrubbing, specifically include poly-use class method for detecting exception record, model method finds not meet the different of present mode
Often record or association rules method find not meet in data set and have high confidence level and the abnormal data of support rule;
The most also it is carried out repeating record.The approximately duplicate record problem eliminated in data set is current data cleansing
The content of most study in field.Record is repeated, it is simply that how to judge whether two records approximate in order to eliminate from data set
Repeat.
In data warehouse applications, data cleansing is it is first necessary to consider data integration, mainly by the structure in data source
It is mapped in object construction and territory with data.Many researcheres have carried out substantial amounts of research work in this respect.
Many data cleansing schemes and algorithm are both for application-specific problem, are only applicable to less scope.General
The algorithm unrelated with application and scheme less.
Data cleansing mainly complete the discrimination to data accepted, extract, the operation such as cleaning.1) extraction: because of the number obtained
According to being likely to be of various structures and type, data extraction process can help us to be converted into single by the data of these complexity
Or it is easy to the configuration processed, to reach the purpose quickly analyzed and processed.2) clean: for big data, and be not all valuable
, some data is not content of interest, and other data are then full of prunes distracters, therefore to lead to data
Filter " denoising " thus extract valid data.
It is that data are sent to processing terminal by network that described data source just transmits data to the mode of processing terminal
In, and data are sent in processing terminal period by network, data are often wanted before being sent to processing terminal by network
Encoded process, but the coding that obtains of existing coding processing mode in transmittance process, be susceptible to outside intercepting and capturing and
Decode easily, so that making the content of data of transmission compromised thus incurring loss.
The following method of the present invention seeks to solve to be easily subject to outside intercepting and capturing during transmission after the encoded process of data
And decode easily so that making the content of data of transmission compromised thus the problem that incurs loss.
It is that data are sent to processing terminal by network that described data source just transmits data to the mode of processing terminal
In, and the method that described data are sent to processing terminal by network, specifically include following steps:
Step 1-1: processing terminal receives the instruction that the requirement of network establishes the link;
Step 1-2: after link is set up, obtains the confirmation instruction encoded first from described network, by described confirmation
Instruction performs decoding, obtains the confirmation instruction after decoding;
Step 1-3: determine that confirming after described decoding instructs the requirement whether meeting the first setting, if so, perform step
Rapid 1-4;
Step 1-4: obtain the label corresponding with described confirmation instruction that described network gives;
Step 1-5: determine that confirming after described label and described decoding instructs the requirement whether meeting the second setting, if
It is to perform step 1-6;
Step 1-6: receive by the data of after described network coding, the data decoding of described coding is obtained corresponding translating
Data after Ma.
Described method includes: processing terminal receives the instruction that establishes the link of requirement of network, after link is set up, receive by
By described, the confirmation instruction that described network encodes first, confirms that instruction performs decoding, obtain the confirmation instruction after decoding, determine institute
State and confirm whether instruction meets the requirement of the first setting after decoding, if so, obtain described network that give with described confirmation
The label that instruction is corresponding, determines that confirming after described label and described decoding instructs the requirement whether meeting the second setting, if
It is to receive by the data of after described network coding, the data decoding after described coding is obtained the data after corresponding decoding,
Achieve the safety transmission of all data.
With the above-mentioned desirable embodiment according to the present invention for enlightenment, by above-mentioned description, relevant staff is complete
Entirely can carry out various change and amendment in the range of without departing from this invention technological thought.The technology of this invention
The content that property scope is not limited in description, it is necessary to determine its technical scope according to right.
Claims (3)
1. the data digging method for big data analysis, it is characterised in that specific as follows:
First building the abnormality detection for big data cleansing and the platform of elimination, it includes data source;
Described data source is connected with processing terminal by network, includes data acquisition module and data in described processing terminal
Cleaning module;
Then, after data acquisition module sends the request of request data to data source, described data source just transmits data to process
Terminal, and processing terminal subsequent start-up data cleansing module carries out data cleansing to described data.
Data digging method for big data analysis the most according to claim 1, it is characterised in that divide for big data
The data digging method of analysis, it is characterised in that the mode of described data cleansing is to use statistical method to detect the number of described data
Value type attribute, the average of calculated field value and standard deviation, utilize the confidence interval of each field to identify exception field and record,
Data digging method is introduced data scrubbing, specifically includes poly-employing class method for detecting exception record, model method discovery
The exception record or the association rules method that do not meet present mode find not meet in data set have high confidence level and support
Metric abnormal data then;
The most also it is carried out repeating record.
Data digging method for big data analysis the most according to claim 1, it is characterised in that
It is that data are sent in processing terminal by network that described data source just transmits data to the mode of processing terminal, and
The method that described data are sent to processing terminal by network, specifically includes following steps:
Step 1-1: processing terminal receives the instruction that the requirement of network establishes the link;
Step 1-2: after link is set up, obtains the confirmation instruction encoded first from described network, confirms instruction by described
Perform decoding, obtain the confirmation instruction after decoding;
Step 1-3: determine that confirming after described decoding instructs the requirement whether meeting the first setting, if so, perform step 1-
4;
Step 1-4: obtain the label corresponding with described confirmation instruction that described network gives;
Step 1-5: determine that confirming after described label and described decoding instructs the requirement whether meeting the second setting, if so,
Perform step 1-6;
Step 1-6: receive by the data of after described network coding, after the data decoding of described coding is obtained corresponding decoding
Data.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610675596.4A CN106250556B (en) | 2016-08-17 | 2016-08-17 | Data digging method for big data analysis |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610675596.4A CN106250556B (en) | 2016-08-17 | 2016-08-17 | Data digging method for big data analysis |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106250556A true CN106250556A (en) | 2016-12-21 |
CN106250556B CN106250556B (en) | 2019-06-18 |
Family
ID=57593128
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610675596.4A Active CN106250556B (en) | 2016-08-17 | 2016-08-17 | Data digging method for big data analysis |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106250556B (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107679087A (en) * | 2017-09-04 | 2018-02-09 | 浙江聚邦科技有限公司 | A kind of growth information gathering mobile terminal microfluidic platform towards medium-sized and small enterprises |
CN107908744A (en) * | 2017-11-16 | 2018-04-13 | 河南中医药大学 | A kind of method of abnormality detection and elimination for big data cleaning |
CN110008208A (en) * | 2019-04-04 | 2019-07-12 | 北京易华录信息技术股份有限公司 | A kind of data administering method and system |
CN110850297A (en) * | 2019-09-23 | 2020-02-28 | 广东毓秀科技有限公司 | Method for predicting SOH of rail-traffic lithium battery through big data |
CN111858570A (en) * | 2020-07-06 | 2020-10-30 | 中国科学院上海有机化学研究所 | CCS data standardization method, database construction method and database system |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1874336A (en) * | 2005-12-31 | 2006-12-06 | 华为技术有限公司 | Method and device for treating data stream |
CN104092663A (en) * | 2013-07-24 | 2014-10-08 | 牟大同 | Encryption communication method and encryption communication system |
CN104111996A (en) * | 2014-07-07 | 2014-10-22 | 山大地纬软件股份有限公司 | Health insurance outpatient clinic big data extraction system and method based on hadoop platform |
CN104750813A (en) * | 2015-03-30 | 2015-07-01 | 浪潮集团有限公司 | Data cleaning method based on data reduction model |
CN105354198A (en) * | 2014-08-19 | 2016-02-24 | 中国移动通信集团湖北有限公司 | Data processing method and apparatus |
-
2016
- 2016-08-17 CN CN201610675596.4A patent/CN106250556B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1874336A (en) * | 2005-12-31 | 2006-12-06 | 华为技术有限公司 | Method and device for treating data stream |
CN104092663A (en) * | 2013-07-24 | 2014-10-08 | 牟大同 | Encryption communication method and encryption communication system |
CN104111996A (en) * | 2014-07-07 | 2014-10-22 | 山大地纬软件股份有限公司 | Health insurance outpatient clinic big data extraction system and method based on hadoop platform |
CN105354198A (en) * | 2014-08-19 | 2016-02-24 | 中国移动通信集团湖北有限公司 | Data processing method and apparatus |
CN104750813A (en) * | 2015-03-30 | 2015-07-01 | 浪潮集团有限公司 | Data cleaning method based on data reduction model |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107679087A (en) * | 2017-09-04 | 2018-02-09 | 浙江聚邦科技有限公司 | A kind of growth information gathering mobile terminal microfluidic platform towards medium-sized and small enterprises |
CN107908744A (en) * | 2017-11-16 | 2018-04-13 | 河南中医药大学 | A kind of method of abnormality detection and elimination for big data cleaning |
CN107908744B (en) * | 2017-11-16 | 2021-05-18 | 河南中医药大学 | Anomaly detection and elimination method for big data cleaning |
CN110008208A (en) * | 2019-04-04 | 2019-07-12 | 北京易华录信息技术股份有限公司 | A kind of data administering method and system |
CN110850297A (en) * | 2019-09-23 | 2020-02-28 | 广东毓秀科技有限公司 | Method for predicting SOH of rail-traffic lithium battery through big data |
CN111858570A (en) * | 2020-07-06 | 2020-10-30 | 中国科学院上海有机化学研究所 | CCS data standardization method, database construction method and database system |
Also Published As
Publication number | Publication date |
---|---|
CN106250556B (en) | 2019-06-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106250556A (en) | Data digging method for big data analysis | |
CN109165234B (en) | Robot abnormity analysis method and device | |
CN105824837A (en) | Log treatment method and device | |
CN109242460B (en) | Payment system based on multiple payment channels and account checking method thereof | |
CN101651561B (en) | Network topology analytical method and system based on rule engine | |
US20170278102A1 (en) | Immunisation method for user behaviour model detection in electronic transaction process | |
CN104732425A (en) | E-commerce platform customer behavior analytical method based on big data | |
CN106125680A (en) | Industrial stokehold data safety processing method based on industry internet and device | |
CN106570119A (en) | Device for quickly obtaining product information and method for obtaining product information | |
CN104883269A (en) | Method and apparatus of treating AC logs | |
CN106792876A (en) | End to end network perception evaluating method and system | |
CN110113421A (en) | A kind of big data information processing system based on Internet of Things | |
CN103810085A (en) | Method and device for performing module testing through data comparison | |
CN107341591B (en) | Intelligent statistical analysis system and method for substation warning information | |
CN107391695A (en) | A kind of information extracting method based on big data | |
CN205540715U (en) | Protocol conversion system based on FIX | |
Kim et al. | COVID-19 variant surveillance in the Republic of Korea | |
BR112014010487B1 (en) | METHOD FOR NOTIFYING INFORMATION FROM A VIRTUAL MICRODIARIES CUSTOMER, DEVICE FOR NOTIFYING INFORMATION FROM A VIRTUAL MICRODIARIES CUSTOMER, AND SERVER | |
CN211481289U (en) | Detection authentication information processing system | |
CN111865689B (en) | Alarm voltage drop method based on index set tree | |
CN111708791B (en) | Automatic data updating system based on block chain | |
EP4331488A1 (en) | Method and system for generating 2d representation of electrocardiogram (ecg) signals | |
CN105447050A (en) | Customer classification processing method and device | |
CN202383677U (en) | Intelligent interaction platform | |
CN114218208A (en) | Network data acquisition, storage and processing method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
PE01 | Entry into force of the registration of the contract for pledge of patent right |
Denomination of invention: Data mining methods for big data analysis Granted publication date: 20190618 Pledgee: Industrial Bank Co.,Ltd. Shanghai People's Square Branch Pledgor: GUIZHOU CHINADATAPAY NETWORK TECHNOLOGY CO.,LTD. Registration number: Y2024310000370 |