CN104182465A - Network-based big data processing method - Google Patents
Network-based big data processing method Download PDFInfo
- Publication number
- CN104182465A CN104182465A CN201410348409.2A CN201410348409A CN104182465A CN 104182465 A CN104182465 A CN 104182465A CN 201410348409 A CN201410348409 A CN 201410348409A CN 104182465 A CN104182465 A CN 104182465A
- Authority
- CN
- China
- Prior art keywords
- data
- webpage
- processing method
- network
- cluster
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/285—Clustering or classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
- G06F16/986—Document structures and storage, e.g. HTML extensions
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a network-based big data processing method which is characterized by collecting data from the internet, classifying and clustering the data and establishing a big database. The method comprises steps as follows: a data collection webpage is customized according to the expected target; principal data blocks of the webpage are determined according to the webpage structure, and webpage data extraction templates are automatically generated for extracting webpage data; the webpage data are uniformly encoded, repeated data are normalized, and data are screened; the data are divided into N data classes according to a preset classification model; data are clustered according to a preset clustering algorithm; and according to the classification and clustering result, the data are uniformly stored, indexes are established, and a big database is formed. According to the network-based big data processing method provided by the invention, the webpage data can be effectively extracted, the repeated information is normalized, and a user can conveniently and effectively utilize the webpage data.
Description
Technical field
The present invention relates to information extraction technique field, relate in particular to a kind of network large data processing method.
Background technology
Information extraction field is an emerging research field, generally refers to from a given collection of document and automatically identifies the type informations such as predefined entity, relation and event, and the process of these information being carried out to structured storage and management.Information extraction all has important application in a lot of fields.
In recent years, along with the development of network, the information on internet is more and more.Nearly all network information is all to present to user's with the form of structuring or semi-structured text.Web page information extraction is exactly extract for information about and carry out structuring processing what comprise in webpage, makes it to become the organizational form that form is the same.The main task of info web is exactly that predetermined information point is extracted from various webpages, then integrates with unified form, conveniently checks and compares.
On the internet, the information of same subject disperses to leave on different websites conventionally, and the form of performance is also different, in prior art, is difficult to complete the web mining of expection.In addition, on internet, information is reprinted frequent, how to realize the normalizing of duplicate message, is also a key.
Summary of the invention
The technical matters existing based on background technology, the present invention proposes a kind of network large data processing method, can effectively extract web data, and duplicate message is carried out to normalizing, facilitates the effective utilization of user to web data.
The present invention propose the network large data processing method of one, Internet image data, and to data classify, cluster, set up large database concept, comprising:
Gather webpage according to re-set target customization data;
According to structure of web page, determine webpage body data block, generating web page data pick-up template extracts web data automatically;
Web data is carried out to Unified coding, by repeating data normalizing, garbled data;
According to prefabricated disaggregated model, data are divided into N data class;
According to prefabricated clustering algorithm, data are carried out to cluster;
According to classification and cluster result, index is stored and is set up in data unification, form large database concept.
Preferably, gather webpage according to re-set target customization data, comprising:
In prefabricated industry, webpage is as data source;
The network probe of built-in domain body is set, automatically finds with body related web page as collection point;
Preferably, web data is carried out to Unified coding, by repeating data normalizing, garbled data, specifically comprises:
Each section of text encoded;
Carry out segmentation contrast according to coding, judge Data duplication degree;
By repeating data normalizing, garbled data.
Preferably, according to classification and cluster result, index is stored and is set up in data unification, form large database concept, be specifically divided into:
N data class carried out to cluster;
The data that comprise in each data class are carried out to cluster.
The present invention extracts the mode of web data, and efficiency is high, and recall ratio is good, avoids information to omit; Can effectively eliminate duplicate message, greatly reduce data and taken up space, eliminate redundancy, reduce the load of subsequent treatment, improve data-handling efficiency; Prefabricated disaggregated model and clustering algorithm, carry out Classification and clustering analysis to data, and data unified storage building database building database index, facilitate management, the search and use of user to extracted data.
Brief description of the drawings
Fig. 1 is the process flow diagram of a kind of network large data processing method of proposing of the present invention.
Embodiment
With reference to Fig. 1, the present invention propose the network large data processing method of one, Internet image data, and to data classify, cluster, set up large database concept, comprise the following steps:
Gather webpage according to re-set target customization data;
According to structure of web page, determine webpage body data block, generating web page data pick-up template extracts web data automatically;
Web data is carried out to Unified coding, by repeating data normalizing, garbled data;
According to prefabricated disaggregated model, data are divided into N data class;
According to prefabricated clustering algorithm, data are carried out to cluster;
According to classification and cluster result, index is stored and is set up in data unification, form large database concept.
In present embodiment, Automatic Extraction web data, efficiency is high, and image data is comparatively comprehensive, avoids information to omit, and data Unified coding is carried out, after repeating data normalizing, having greatly reduced data and having taken up space, and eliminates redundancy, has reduced the load of subsequent treatment.In present embodiment, data are carried out to Classification and clustering, then according to Classification and clustering result building database index, facilitate management, the search and use of user to extracted data.
In present embodiment, gather webpage according to re-set target customization data, there are two kinds of modes in the source that gathers webpage, is respectively:
In prefabricated industry, webpage is as data source;
The network probe of built-in domain body is set, automatically finds with body related web page as collection point.
The prefabricated webpage of paying close attention to user's expection of data source, makes the draw-off direction of web data more pointed, is conducive to improve data acquisition efficiency.Collection point can be at last to the supplementing of data source, improve the recall ratio of data acquisition.The complementation of data source and collection point, can make data acquisition efficiency and recall ratio reach a more satisfactory balance.
In present embodiment, web data is carried out to Unified coding, by repeating data normalizing, garbled data, specifically comprises:
Each section of text encoded;
Carry out segmentation contrast according to coding, judge Data duplication degree;
By repeating data normalizing, garbled data.
This text carries out segment encoding, and carries out segmentation contrast, can effectively find that text repeats degree, avoids omitting.
In present embodiment, according to classification and cluster result, index is stored and is set up in data unification, form large database concept, be specifically divided into:
N data class carried out to cluster;
The data that comprise in each data class are carried out to cluster.
According to classification results, database is divided into topic, two ranks of data class, two kinds of cluster analyses carrying out on this basis, database can be subdivided into topic, topic bunch, data class, data class bunch four ranks, further set up Indexing Mechanism, make the management of user to database, retrieve, utilize convenient.
The above; it is only preferably embodiment of the present invention; but protection scope of the present invention is not limited to this; any be familiar with those skilled in the art the present invention disclose technical scope in; be equal to replacement or changed according to technical scheme of the present invention and inventive concept thereof, within all should being encompassed in protection scope of the present invention.
Claims (4)
1. a network large data processing method, is characterized in that, Internet image data, and to data classify, cluster, set up large database concept, comprising:
Gather webpage according to re-set target customization data;
According to structure of web page, determine webpage body data block, generating web page data pick-up template extracts web data automatically;
Web data is carried out to Unified coding, by repeating data normalizing, garbled data;
According to prefabricated disaggregated model, data are divided into N data class;
According to prefabricated clustering algorithm, data are carried out to cluster;
According to classification and cluster result, index is stored and is set up in data unification, form large database concept.
2. network large data processing method as claimed in claim 1, is characterized in that, gathers webpage according to re-set target customization data, comprising:
In prefabricated industry, webpage is as data source;
The network probe of built-in domain body is set, automatically finds with body related web page as collection point.
3. network large data processing method as claimed in claim 1, is characterized in that, web data is carried out to Unified coding, and by repeating data normalizing, garbled data, specifically comprises:
Each section of text encoded;
Carry out segmentation contrast according to coding, judge Data duplication degree;
By repeating data normalizing, garbled data.
4. network large data processing method as claimed in claim 1, is characterized in that, according to classification and cluster result, index is stored and set up in data unification, forms large database concept, is specifically divided into:
N data class carried out to cluster;
The data that comprise in each data class are carried out to cluster.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410348409.2A CN104182465A (en) | 2014-07-21 | 2014-07-21 | Network-based big data processing method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410348409.2A CN104182465A (en) | 2014-07-21 | 2014-07-21 | Network-based big data processing method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN104182465A true CN104182465A (en) | 2014-12-03 |
Family
ID=51963505
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410348409.2A Pending CN104182465A (en) | 2014-07-21 | 2014-07-21 | Network-based big data processing method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104182465A (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105160014A (en) * | 2015-09-24 | 2015-12-16 | 四川师范大学 | Data processing method and apparatus |
CN106503113A (en) * | 2016-10-18 | 2017-03-15 | 安徽天达网络科技有限公司 | A kind of data processing method based on LAN |
CN107092618A (en) * | 2016-10-27 | 2017-08-25 | 北京小度信息科技有限公司 | A kind of information processing method and device |
CN107992534A (en) * | 2017-11-23 | 2018-05-04 | 安徽科创智慧知识产权服务有限公司 | The method that improved sort key sorts data set |
CN108108747A (en) * | 2017-09-21 | 2018-06-01 | 西安交通大学 | A kind of clustering method for the view-based access control model principle for solving big data cluster |
CN108399205A (en) * | 2018-01-31 | 2018-08-14 | 佛山市聚成知识产权服务有限公司 | A kind of data high-speed processing conversion communication means and device |
CN109803022A (en) * | 2019-01-30 | 2019-05-24 | 浙江蓝鸽科技有限公司 | A kind of digitalization resource shared system and its method of servicing |
CN110609834A (en) * | 2018-05-29 | 2019-12-24 | 西安电子科技大学 | Multi-source heterogeneous government affair data extraction system based on Agent |
CN113435199A (en) * | 2021-07-18 | 2021-09-24 | 谢勇 | Storage and reading interference method and system for character corresponding culture |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101076800A (en) * | 2004-08-23 | 2007-11-21 | 汤姆森环球资源公司 | Repetitive file detecting and displaying function |
CN102402592A (en) * | 2011-11-04 | 2012-04-04 | 同辉佳视(北京)信息技术股份有限公司 | Information collecting method based on webpage data mining |
CN103023970A (en) * | 2012-11-15 | 2013-04-03 | 中国科学院计算机网络信息中心 | Method and system for storing mass data of Internet of Things (IoT) |
CN103150335A (en) * | 2013-01-25 | 2013-06-12 | 河南理工大学 | Co-clustering-based coal mine public sentiment monitoring system |
CN103389998A (en) * | 2012-05-11 | 2013-11-13 | 安徽华贞信息科技有限公司 | Novel Internet commercial intelligence information semantic analysis technology based on cloud service |
-
2014
- 2014-07-21 CN CN201410348409.2A patent/CN104182465A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101076800A (en) * | 2004-08-23 | 2007-11-21 | 汤姆森环球资源公司 | Repetitive file detecting and displaying function |
US7809695B2 (en) * | 2004-08-23 | 2010-10-05 | Thomson Reuters Global Resources | Information retrieval systems with duplicate document detection and presentation functions |
CN102402592A (en) * | 2011-11-04 | 2012-04-04 | 同辉佳视(北京)信息技术股份有限公司 | Information collecting method based on webpage data mining |
CN103389998A (en) * | 2012-05-11 | 2013-11-13 | 安徽华贞信息科技有限公司 | Novel Internet commercial intelligence information semantic analysis technology based on cloud service |
CN103023970A (en) * | 2012-11-15 | 2013-04-03 | 中国科学院计算机网络信息中心 | Method and system for storing mass data of Internet of Things (IoT) |
CN103150335A (en) * | 2013-01-25 | 2013-06-12 | 河南理工大学 | Co-clustering-based coal mine public sentiment monitoring system |
Non-Patent Citations (1)
Title |
---|
俞昊旻: "文档部分重复检测研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105160014A (en) * | 2015-09-24 | 2015-12-16 | 四川师范大学 | Data processing method and apparatus |
CN106503113A (en) * | 2016-10-18 | 2017-03-15 | 安徽天达网络科技有限公司 | A kind of data processing method based on LAN |
CN107092618A (en) * | 2016-10-27 | 2017-08-25 | 北京小度信息科技有限公司 | A kind of information processing method and device |
CN108108747A (en) * | 2017-09-21 | 2018-06-01 | 西安交通大学 | A kind of clustering method for the view-based access control model principle for solving big data cluster |
CN107992534A (en) * | 2017-11-23 | 2018-05-04 | 安徽科创智慧知识产权服务有限公司 | The method that improved sort key sorts data set |
CN108399205A (en) * | 2018-01-31 | 2018-08-14 | 佛山市聚成知识产权服务有限公司 | A kind of data high-speed processing conversion communication means and device |
CN110609834A (en) * | 2018-05-29 | 2019-12-24 | 西安电子科技大学 | Multi-source heterogeneous government affair data extraction system based on Agent |
CN110609834B (en) * | 2018-05-29 | 2023-04-18 | 西安电子科技大学 | Multi-source heterogeneous government affair data extraction system based on Agent |
CN109803022A (en) * | 2019-01-30 | 2019-05-24 | 浙江蓝鸽科技有限公司 | A kind of digitalization resource shared system and its method of servicing |
CN109803022B (en) * | 2019-01-30 | 2022-02-18 | 浙江蓝鸽科技有限公司 | Digital resource sharing system and service method thereof |
CN113435199A (en) * | 2021-07-18 | 2021-09-24 | 谢勇 | Storage and reading interference method and system for character corresponding culture |
CN113435199B (en) * | 2021-07-18 | 2023-05-26 | 谢勇 | Storage and reading interference method and system for character corresponding culture |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104182465A (en) | Network-based big data processing method | |
CN102193936B (en) | Data classification method and device | |
CN105468744B (en) | Big data platform for realizing tax public opinion analysis and full text retrieval | |
CN106489149A (en) | A kind of data mask method based on data mining and mass-rent and system | |
CN109726393B (en) | Policy analysis system and method based on natural language processing technology | |
CN111899089A (en) | Enterprise risk early warning method and system based on knowledge graph | |
CN102542061B (en) | Intelligent product classification method | |
CN112650848A (en) | Urban railway public opinion information analysis method based on text semantic related passenger evaluation | |
CN103020159A (en) | Method and device for news presentation facing events | |
CN107577724A (en) | A kind of big data processing method | |
CN106599160A (en) | Content rule base management system and encoding method thereof | |
CN103324700A (en) | Noumenon concept attribute learning method based on Web information | |
CN104537341A (en) | Human face picture information obtaining method and device | |
CN106844782B (en) | Network-oriented multi-channel big data acquisition system and method | |
CN106649718B (en) | A kind of big data acquisition and processing method for PDM system | |
CN104008107A (en) | Implement method of knowledge base on operation and maintenance management | |
CN111143394B (en) | Knowledge data processing method, device, medium and electronic equipment | |
CN105095436A (en) | Automatic modeling method for data of data sources | |
CN104268214B (en) | A kind of user's gender identification method and system based on microblog users relation | |
CN103226577A (en) | News clustering method | |
KR102345410B1 (en) | Big data intelligent collecting method and device | |
CN112084448A (en) | Similar information processing method and device | |
CN103870567A (en) | Automatic identifying method for webpage collecting template of vertical search engine in cloud computing | |
CN105631634A (en) | Cross-terminal interactive logistics big data real-time analysis system | |
CN206224473U (en) | Information collection system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
EXSB | Decision made by sipo to initiate substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20141203 |
|
RJ01 | Rejection of invention patent application after publication |