CN104750813A - Data cleaning method based on data reduction model - Google Patents

Data cleaning method based on data reduction model Download PDF

Info

Publication number
CN104750813A
CN104750813A CN201510143215.3A CN201510143215A CN104750813A CN 104750813 A CN104750813 A CN 104750813A CN 201510143215 A CN201510143215 A CN 201510143215A CN 104750813 A CN104750813 A CN 104750813A
Authority
CN
China
Prior art keywords
data
cleaning method
regularization
method based
internet
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510143215.3A
Other languages
Chinese (zh)
Inventor
赵虎
徐宏伟
王传超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Group Co Ltd
Original Assignee
Inspur Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Group Co Ltd filed Critical Inspur Group Co Ltd
Priority to CN201510143215.3A priority Critical patent/CN104750813A/en
Publication of CN104750813A publication Critical patent/CN104750813A/en
Pending legal-status Critical Current

Links

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a data cleaning method based on a data reduction model and relates to data processing technologies. The data cleaning method comprises the main steps of collecting data, establishing a data reduction model and cleaning the data, wherein the massive collected data are classified and cleaned through the data reduction module. According to the data cleaning method, massive data of the Internet are classified and cleaned through the data reduction module, the problems of incompleteness, noise, inconsistency and the like of the data are solved, a higher cleaning processing precision is obtained under the condition of obtaining fewer data, the data cleaning efficiency is improved, and the purpose of effectively utilizing the data is achieved.

Description

A kind of Data Cleaning Method based on data regularization model
Technical field
The present invention relates to data processing technique, is exactly a kind of Data Cleaning Method based on data regularization model specifically.
Background technology
Along with Internet era development, enter large data age now, user Internet era the data grows that produces many, each large data company application to data also gets more and more, but also there is a large amount of problems in the practical application of data.At present, the data that internet produces also exist the feature of a large amount of imperfections, noisy property, inconsistency, so just cannot better utilize internet data to carry out the analytical work of being correlated with.
Data mining refers to the process excavating effective knowledge from the mass data leaving database, data warehouse or other information banks in.Data mining extracts implicit, valuable and intelligible information from mass data, to instruct the activity of people.Data mining technology mainly contains correlation rule, classifying rules, cluster analysis and sequence pattern etc.
Data cleansing refers to " the washing off " of " dirty ", refers to find and last one program of discernible mistake in correction of data file, comprises inspection data consistency, process invalid value and missing values etc.Because the data in data warehouse are set of the data towards a certain theme, these data extract and comprise historical data from multiple operation system, are misdata with regard to keeping away the data unavoidably had like this, the data that have have conflict each other, these mistakes or the data that have conflict be obviously undesired, so need these useless data cleansings to fall.
Summary of the invention
For the weak point that prior art exists, the present invention proposes a kind of Data Cleaning Method based on data regularization model.
A kind of Data Cleaning Method based on data regularization model of the present invention, the technical scheme solving the problems of the technologies described above employing is as follows: the key step of this Data Cleaning Method comprises: image data, set up data regularization module and cleaning data, described data regularization module is utilized to classify to the mass data gathered, clean, obtain higher cleaning treatment precision to obtain less data, reach the object of rational and efficient use data.
Preferably, described image data, utilizes vertical search engine technology by internet, from network collection structuring and non-structured network electricity quotient data; Search for all inside and outside data messages relevant with business object, and therefrom select the data being applicable to data mining application.
Preferably, utilize vertical search engine technology to build internet data capture program, by this internet data capture program according to the data on the commodity classification collection internet of setting, after data acquisition, unification is stored in raw data warehouse.
Preferably, describedly data regularization module is set up: by coding, utilize data cube to assemble, data compression, numerical value reduction, discretization technique, set up different data regularization models according to different data acquisitions, utilize data regularization module to process data.
Preferably, utilize above-mentioned program of writing to gather complete data to internet to clean.
Preferably, the complete data of cleaning are carried out technical Analysis.
A kind of beneficial effect compared with prior art had based on the Data Cleaning Method of data regularization model of the present invention is: maintenance data reduction model of the present invention, internet mass data are classified, cleaned, solve the problem such as imperfection, noisy property, inconsistency of data, obtain less data, improve data cleansing efficiency, obtain higher cleaning treatment precision, to reach the object effectively utilizing data.
Accompanying drawing explanation
Accompanying drawing 1 is the structural representation of the described Data Cleaning Method based on data regularization model.
Embodiment
For making the object, technical solutions and advantages of the present invention clearly understand, below in conjunction with specific embodiment, and with reference to accompanying drawing, a kind of Data Cleaning Method based on data regularization model of the present invention is further described.
A kind of Data Cleaning Method based on data regularization model of the present invention, maintenance data reduction model, classifies to mass data, cleans to reach and effectively utilize data; Its key step comprises: image data, set up data regularization module and cleaning data, utilize described data regularization module to clean image data, to obtain less data, improve cleaning efficiency, obtain higher cleaning treatment precision.
Embodiment:
A kind of Data Cleaning Method based on data regularization model described in the present embodiment, carries out image data, utilizes vertical search engine technology by internet, from network collection structuring and non-structured network electricity quotient data; Search for all inside and outside data messages relevant with business object, and therefrom select the data being applicable to data mining application.
The described step setting up data regularization module: by coding, utilizes data cube to assemble, data compression, numerical value reduction, the technology such as discretize, set up different data regularization models according to different data acquisitions, utilize data regularization module to process data.
The concrete steps of described cleaning data, as shown in Figure 1:
The first step, utilizes vertical search engine technology to build internet data capture program, and this internet data capture program can gather the data on internet according to the commodity classification of setting, after data acquisition, unification is stored in raw data warehouse; This raw data warehouse is the basic of whole system;
Second step, coding, utilizes data cube to assemble, data compression, numerical value reduction, the technology such as discretize, sets up different data regularization models according to different data acquisitions, utilizes data regularization model to process data;
3rd step, utilizes the program of writing in second step to gather complete data to internet and cleans, to reduce the characteristics such as the inconsistency of data, noisy property, inconsistency;
The complete data of cleaning are carried out the technical Analysis of science by the 4th step.
Above-mentioned embodiment is only concrete case of the present invention; scope of patent protection of the present invention includes but not limited to above-mentioned embodiment; any claims according to the invention and any person of an ordinary skill in the technical field to its suitable change done or replacement, all should fall into scope of patent protection of the present invention.

Claims (6)

1. based on a Data Cleaning Method for data regularization model, it is characterized in that, its key step comprises: image data, set up data regularization module and cleaning data, utilize described data regularization module to gather mass data classify, clean.
2. a kind of Data Cleaning Method based on data regularization model according to claim 1, is characterized in that described image data utilizes vertical search engine technology by internet, from network collection structuring and non-structured network electricity quotient data; Search for all inside and outside data messages relevant with business object, and therefrom select the data being applicable to data mining application.
3. a kind of Data Cleaning Method based on data regularization model according to claim 2, it is characterized in that, vertical search engine technology is utilized to build internet data capture program, by internet data capture program according to the data on the commodity classification collection internet of setting, after data acquisition, unification is stored in raw data warehouse.
4. according to the arbitrary described a kind of Data Cleaning Method based on data regularization model of claims 1 to 3, it is characterized in that, describedly set up data regularization module: pass through coding, data cube is utilized to assemble, data compression, numerical value reduction, discretization technique, set up different data regularization models according to different data acquisitions, utilize data regularization module to process data.
5. a kind of Data Cleaning Method based on data regularization model according to claim 4, is characterized in that, utilize above-mentioned program of writing to gather complete data to internet and clean.
6. a kind of Data Cleaning Method based on data regularization model according to claim 5, is characterized in that, the complete data of cleaning are carried out technical Analysis.
CN201510143215.3A 2015-03-30 2015-03-30 Data cleaning method based on data reduction model Pending CN104750813A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510143215.3A CN104750813A (en) 2015-03-30 2015-03-30 Data cleaning method based on data reduction model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510143215.3A CN104750813A (en) 2015-03-30 2015-03-30 Data cleaning method based on data reduction model

Publications (1)

Publication Number Publication Date
CN104750813A true CN104750813A (en) 2015-07-01

Family

ID=53590497

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510143215.3A Pending CN104750813A (en) 2015-03-30 2015-03-30 Data cleaning method based on data reduction model

Country Status (1)

Country Link
CN (1) CN104750813A (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106228270A (en) * 2016-07-27 2016-12-14 广东工业大学 The energy consumption Forecasting Methodology of the extrusion equipment of a kind of big data-driven and system thereof
CN106250556A (en) * 2016-08-17 2016-12-21 贵州数据宝网络科技有限公司 Data digging method for big data analysis
CN106354772A (en) * 2016-08-23 2017-01-25 成都卡莱博尔信息技术股份有限公司 Mass data system with data cleaning function
CN106484887A (en) * 2016-10-18 2017-03-08 安徽天达网络科技有限公司 A kind of document handling method based on internet
CN106503114A (en) * 2016-10-18 2017-03-15 安徽天达网络科技有限公司 Commodity resource data obtains system
CN106649523A (en) * 2016-10-18 2017-05-10 安徽天达网络科技有限公司 Commodity resource data processing method
CN106649516A (en) * 2016-10-18 2017-05-10 安徽天达网络科技有限公司 A large data processing method for educational resources
CN108335231A (en) * 2018-01-29 2018-07-27 国网福建省电力有限公司 A kind of power distribution network data diagnosis method of Auto-matching
CN109143017A (en) * 2018-07-31 2019-01-04 成都天衡智造科技有限公司 A kind of semicon industry production test data processing method
CN109947751A (en) * 2018-12-29 2019-06-28 医渡云(北京)技术有限公司 A kind of medical data processing method, device, readable medium and electronic equipment
CN112019869A (en) * 2020-08-21 2020-12-01 广州欢网科技有限责任公司 Live broadcast data processing method and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090327330A1 (en) * 2008-06-27 2009-12-31 Business Objects, S.A. Apparatus and method for dynamically materializing a multi-dimensional data stream cube
CN101901260A (en) * 2010-07-20 2010-12-01 北京酷讯科技有限公司 Method and system for displaying vertical search result in real time
CN102088459A (en) * 2010-12-29 2011-06-08 广东楚天龙智能卡有限公司 Large-centralized data exchanging and integration platform based on trusted exchange
CN102915423A (en) * 2012-09-11 2013-02-06 中国电力科学研究院 System and method for filtering electric power business data on basis of rough sets and gene expressions

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090327330A1 (en) * 2008-06-27 2009-12-31 Business Objects, S.A. Apparatus and method for dynamically materializing a multi-dimensional data stream cube
CN101901260A (en) * 2010-07-20 2010-12-01 北京酷讯科技有限公司 Method and system for displaying vertical search result in real time
CN102088459A (en) * 2010-12-29 2011-06-08 广东楚天龙智能卡有限公司 Large-centralized data exchanging and integration platform based on trusted exchange
CN102915423A (en) * 2012-09-11 2013-02-06 中国电力科学研究院 System and method for filtering electric power business data on basis of rough sets and gene expressions

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
方洪鹰: ""数据挖掘中数据预处理的方法研究"", 《中国优秀硕士学位论文全文数据库(信息科技辑)》 *

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106228270A (en) * 2016-07-27 2016-12-14 广东工业大学 The energy consumption Forecasting Methodology of the extrusion equipment of a kind of big data-driven and system thereof
CN106228270B (en) * 2016-07-27 2020-11-10 广东工业大学 Energy consumption prediction method and system for big data driven extrusion equipment
CN106250556B (en) * 2016-08-17 2019-06-18 贵州数据宝网络科技有限公司 Data digging method for big data analysis
CN106250556A (en) * 2016-08-17 2016-12-21 贵州数据宝网络科技有限公司 Data digging method for big data analysis
CN106354772A (en) * 2016-08-23 2017-01-25 成都卡莱博尔信息技术股份有限公司 Mass data system with data cleaning function
CN106484887A (en) * 2016-10-18 2017-03-08 安徽天达网络科技有限公司 A kind of document handling method based on internet
CN106649516A (en) * 2016-10-18 2017-05-10 安徽天达网络科技有限公司 A large data processing method for educational resources
CN106649523A (en) * 2016-10-18 2017-05-10 安徽天达网络科技有限公司 Commodity resource data processing method
CN106503114A (en) * 2016-10-18 2017-03-15 安徽天达网络科技有限公司 Commodity resource data obtains system
CN108335231A (en) * 2018-01-29 2018-07-27 国网福建省电力有限公司 A kind of power distribution network data diagnosis method of Auto-matching
CN109143017A (en) * 2018-07-31 2019-01-04 成都天衡智造科技有限公司 A kind of semicon industry production test data processing method
CN109143017B (en) * 2018-07-31 2021-03-30 成都天衡智造科技有限公司 Production test data processing method for semiconductor industry
CN109947751A (en) * 2018-12-29 2019-06-28 医渡云(北京)技术有限公司 A kind of medical data processing method, device, readable medium and electronic equipment
CN112019869A (en) * 2020-08-21 2020-12-01 广州欢网科技有限责任公司 Live broadcast data processing method and device
CN112019869B (en) * 2020-08-21 2022-04-22 广州欢网科技有限责任公司 Live broadcast data processing method and device

Similar Documents

Publication Publication Date Title
CN104750813A (en) Data cleaning method based on data reduction model
CN109465676B (en) Tool life prediction method
CN105956015A (en) Service platform integration method based on big data
WO2016101628A1 (en) Data processing method and device in data modeling
CN101806583B (en) Microscopic image-based fiber fineness measurement method
CN106104496A (en) The abnormality detection not being subjected to supervision for arbitrary sequence
CN106709035A (en) Preprocessing system for electric power multi-dimensional panoramic data
CN102915432A (en) Method and device for extracting vehicle-bone microcomputer image video data
CN103455636A (en) Automatic capturing and intelligent analyzing method based on Internet tax data
CN106294390A (en) A kind of data mining analysis method and system
CN110263230A (en) A kind of data cleaning method and device based on Density Clustering
CN112462696A (en) Intelligent manufacturing workshop digital twin model construction method and system
CN106709622A (en) Database analysis device and database analysis method
CN103752533A (en) Method and device for sorting coal and gangue online through image method
CN114596061B (en) Project data management method and system based on big data
CN107784652A (en) A kind of shaft tower quick determination method based on unmanned plane image
CN104516962A (en) Monitoring method and system for microblogging public opinion
CN104850549A (en) Method for monitoring public opinions on Internet
CN111709775A (en) House property price evaluation method and device, electronic equipment and storage medium
CN115249331A (en) Mine ecological safety identification method based on convolutional neural network model
CN202815869U (en) Vehicle microcomputer image and video data extraction apparatus
CN104750812A (en) Automatic data collecting method based on webpage label analysis
Gill et al. Garbage Classification Utilizing Effective Convolutional Neural Network
CN106933990A (en) A kind of sensing data cleaning method
Jiashun A new trajectory clustering algorithm based on TRACLUS

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20150701

WD01 Invention patent application deemed withdrawn after publication