CN104182465A - Network-based big data processing method - Google Patents

Network-based big data processing method Download PDF

Info

Publication number
CN104182465A
CN104182465A CN201410348409.2A CN201410348409A CN104182465A CN 104182465 A CN104182465 A CN 104182465A CN 201410348409 A CN201410348409 A CN 201410348409A CN 104182465 A CN104182465 A CN 104182465A
Authority
CN
China
Prior art keywords
data
webpage
processing method
network
cluster
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410348409.2A
Other languages
Chinese (zh)
Inventor
贾岩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
ANHUI HUAZHEN INFORMATION SCIENCE & TECHNOLOGY Co Ltd
Original Assignee
ANHUI HUAZHEN INFORMATION SCIENCE & TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ANHUI HUAZHEN INFORMATION SCIENCE & TECHNOLOGY Co Ltd filed Critical ANHUI HUAZHEN INFORMATION SCIENCE & TECHNOLOGY Co Ltd
Priority to CN201410348409.2A priority Critical patent/CN104182465A/en
Publication of CN104182465A publication Critical patent/CN104182465A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • G06F16/986Document structures and storage, e.g. HTML extensions

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a network-based big data processing method which is characterized by collecting data from the internet, classifying and clustering the data and establishing a big database. The method comprises steps as follows: a data collection webpage is customized according to the expected target; principal data blocks of the webpage are determined according to the webpage structure, and webpage data extraction templates are automatically generated for extracting webpage data; the webpage data are uniformly encoded, repeated data are normalized, and data are screened; the data are divided into N data classes according to a preset classification model; data are clustered according to a preset clustering algorithm; and according to the classification and clustering result, the data are uniformly stored, indexes are established, and a big database is formed. According to the network-based big data processing method provided by the invention, the webpage data can be effectively extracted, the repeated information is normalized, and a user can conveniently and effectively utilize the webpage data.

Description

A kind of network large data processing method
Technical field
The present invention relates to information extraction technique field, relate in particular to a kind of network large data processing method.
Background technology
Information extraction field is an emerging research field, generally refers to from a given collection of document and automatically identifies the type informations such as predefined entity, relation and event, and the process of these information being carried out to structured storage and management.Information extraction all has important application in a lot of fields.
In recent years, along with the development of network, the information on internet is more and more.Nearly all network information is all to present to user's with the form of structuring or semi-structured text.Web page information extraction is exactly extract for information about and carry out structuring processing what comprise in webpage, makes it to become the organizational form that form is the same.The main task of info web is exactly that predetermined information point is extracted from various webpages, then integrates with unified form, conveniently checks and compares.
On the internet, the information of same subject disperses to leave on different websites conventionally, and the form of performance is also different, in prior art, is difficult to complete the web mining of expection.In addition, on internet, information is reprinted frequent, how to realize the normalizing of duplicate message, is also a key.
Summary of the invention
The technical matters existing based on background technology, the present invention proposes a kind of network large data processing method, can effectively extract web data, and duplicate message is carried out to normalizing, facilitates the effective utilization of user to web data.
The present invention propose the network large data processing method of one, Internet image data, and to data classify, cluster, set up large database concept, comprising:
Gather webpage according to re-set target customization data;
According to structure of web page, determine webpage body data block, generating web page data pick-up template extracts web data automatically;
Web data is carried out to Unified coding, by repeating data normalizing, garbled data;
According to prefabricated disaggregated model, data are divided into N data class;
According to prefabricated clustering algorithm, data are carried out to cluster;
According to classification and cluster result, index is stored and is set up in data unification, form large database concept.
Preferably, gather webpage according to re-set target customization data, comprising:
In prefabricated industry, webpage is as data source;
The network probe of built-in domain body is set, automatically finds with body related web page as collection point;
Preferably, web data is carried out to Unified coding, by repeating data normalizing, garbled data, specifically comprises:
Each section of text encoded;
Carry out segmentation contrast according to coding, judge Data duplication degree;
By repeating data normalizing, garbled data.
Preferably, according to classification and cluster result, index is stored and is set up in data unification, form large database concept, be specifically divided into:
N data class carried out to cluster;
The data that comprise in each data class are carried out to cluster.
The present invention extracts the mode of web data, and efficiency is high, and recall ratio is good, avoids information to omit; Can effectively eliminate duplicate message, greatly reduce data and taken up space, eliminate redundancy, reduce the load of subsequent treatment, improve data-handling efficiency; Prefabricated disaggregated model and clustering algorithm, carry out Classification and clustering analysis to data, and data unified storage building database building database index, facilitate management, the search and use of user to extracted data.
Brief description of the drawings
Fig. 1 is the process flow diagram of a kind of network large data processing method of proposing of the present invention.
Embodiment
With reference to Fig. 1, the present invention propose the network large data processing method of one, Internet image data, and to data classify, cluster, set up large database concept, comprise the following steps:
Gather webpage according to re-set target customization data;
According to structure of web page, determine webpage body data block, generating web page data pick-up template extracts web data automatically;
Web data is carried out to Unified coding, by repeating data normalizing, garbled data;
According to prefabricated disaggregated model, data are divided into N data class;
According to prefabricated clustering algorithm, data are carried out to cluster;
According to classification and cluster result, index is stored and is set up in data unification, form large database concept.
In present embodiment, Automatic Extraction web data, efficiency is high, and image data is comparatively comprehensive, avoids information to omit, and data Unified coding is carried out, after repeating data normalizing, having greatly reduced data and having taken up space, and eliminates redundancy, has reduced the load of subsequent treatment.In present embodiment, data are carried out to Classification and clustering, then according to Classification and clustering result building database index, facilitate management, the search and use of user to extracted data.
In present embodiment, gather webpage according to re-set target customization data, there are two kinds of modes in the source that gathers webpage, is respectively:
In prefabricated industry, webpage is as data source;
The network probe of built-in domain body is set, automatically finds with body related web page as collection point.
The prefabricated webpage of paying close attention to user's expection of data source, makes the draw-off direction of web data more pointed, is conducive to improve data acquisition efficiency.Collection point can be at last to the supplementing of data source, improve the recall ratio of data acquisition.The complementation of data source and collection point, can make data acquisition efficiency and recall ratio reach a more satisfactory balance.
In present embodiment, web data is carried out to Unified coding, by repeating data normalizing, garbled data, specifically comprises:
Each section of text encoded;
Carry out segmentation contrast according to coding, judge Data duplication degree;
By repeating data normalizing, garbled data.
This text carries out segment encoding, and carries out segmentation contrast, can effectively find that text repeats degree, avoids omitting.
In present embodiment, according to classification and cluster result, index is stored and is set up in data unification, form large database concept, be specifically divided into:
N data class carried out to cluster;
The data that comprise in each data class are carried out to cluster.
According to classification results, database is divided into topic, two ranks of data class, two kinds of cluster analyses carrying out on this basis, database can be subdivided into topic, topic bunch, data class, data class bunch four ranks, further set up Indexing Mechanism, make the management of user to database, retrieve, utilize convenient.
The above; it is only preferably embodiment of the present invention; but protection scope of the present invention is not limited to this; any be familiar with those skilled in the art the present invention disclose technical scope in; be equal to replacement or changed according to technical scheme of the present invention and inventive concept thereof, within all should being encompassed in protection scope of the present invention.

Claims (4)

1. a network large data processing method, is characterized in that, Internet image data, and to data classify, cluster, set up large database concept, comprising:
Gather webpage according to re-set target customization data;
According to structure of web page, determine webpage body data block, generating web page data pick-up template extracts web data automatically;
Web data is carried out to Unified coding, by repeating data normalizing, garbled data;
According to prefabricated disaggregated model, data are divided into N data class;
According to prefabricated clustering algorithm, data are carried out to cluster;
According to classification and cluster result, index is stored and is set up in data unification, form large database concept.
2. network large data processing method as claimed in claim 1, is characterized in that, gathers webpage according to re-set target customization data, comprising:
In prefabricated industry, webpage is as data source;
The network probe of built-in domain body is set, automatically finds with body related web page as collection point.
3. network large data processing method as claimed in claim 1, is characterized in that, web data is carried out to Unified coding, and by repeating data normalizing, garbled data, specifically comprises:
Each section of text encoded;
Carry out segmentation contrast according to coding, judge Data duplication degree;
By repeating data normalizing, garbled data.
4. network large data processing method as claimed in claim 1, is characterized in that, according to classification and cluster result, index is stored and set up in data unification, forms large database concept, is specifically divided into:
N data class carried out to cluster;
The data that comprise in each data class are carried out to cluster.
CN201410348409.2A 2014-07-21 2014-07-21 Network-based big data processing method Pending CN104182465A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410348409.2A CN104182465A (en) 2014-07-21 2014-07-21 Network-based big data processing method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410348409.2A CN104182465A (en) 2014-07-21 2014-07-21 Network-based big data processing method

Publications (1)

Publication Number Publication Date
CN104182465A true CN104182465A (en) 2014-12-03

Family

ID=51963505

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410348409.2A Pending CN104182465A (en) 2014-07-21 2014-07-21 Network-based big data processing method

Country Status (1)

Country Link
CN (1) CN104182465A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105160014A (en) * 2015-09-24 2015-12-16 四川师范大学 Data processing method and apparatus
CN106503113A (en) * 2016-10-18 2017-03-15 安徽天达网络科技有限公司 A kind of data processing method based on LAN
CN107092618A (en) * 2016-10-27 2017-08-25 北京小度信息科技有限公司 A kind of information processing method and device
CN107992534A (en) * 2017-11-23 2018-05-04 安徽科创智慧知识产权服务有限公司 The method that improved sort key sorts data set
CN108108747A (en) * 2017-09-21 2018-06-01 西安交通大学 A kind of clustering method for the view-based access control model principle for solving big data cluster
CN108399205A (en) * 2018-01-31 2018-08-14 佛山市聚成知识产权服务有限公司 A kind of data high-speed processing conversion communication means and device
CN109803022A (en) * 2019-01-30 2019-05-24 浙江蓝鸽科技有限公司 A kind of digitalization resource shared system and its method of servicing
CN110609834A (en) * 2018-05-29 2019-12-24 西安电子科技大学 Multi-source heterogeneous government affair data extraction system based on Agent
CN113435199A (en) * 2021-07-18 2021-09-24 谢勇 Storage and reading interference method and system for character corresponding culture

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101076800A (en) * 2004-08-23 2007-11-21 汤姆森环球资源公司 Repetitive file detecting and displaying function
CN102402592A (en) * 2011-11-04 2012-04-04 同辉佳视(北京)信息技术股份有限公司 Information collecting method based on webpage data mining
CN103023970A (en) * 2012-11-15 2013-04-03 中国科学院计算机网络信息中心 Method and system for storing mass data of Internet of Things (IoT)
CN103150335A (en) * 2013-01-25 2013-06-12 河南理工大学 Co-clustering-based coal mine public sentiment monitoring system
CN103389998A (en) * 2012-05-11 2013-11-13 安徽华贞信息科技有限公司 Novel Internet commercial intelligence information semantic analysis technology based on cloud service

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101076800A (en) * 2004-08-23 2007-11-21 汤姆森环球资源公司 Repetitive file detecting and displaying function
US7809695B2 (en) * 2004-08-23 2010-10-05 Thomson Reuters Global Resources Information retrieval systems with duplicate document detection and presentation functions
CN102402592A (en) * 2011-11-04 2012-04-04 同辉佳视(北京)信息技术股份有限公司 Information collecting method based on webpage data mining
CN103389998A (en) * 2012-05-11 2013-11-13 安徽华贞信息科技有限公司 Novel Internet commercial intelligence information semantic analysis technology based on cloud service
CN103023970A (en) * 2012-11-15 2013-04-03 中国科学院计算机网络信息中心 Method and system for storing mass data of Internet of Things (IoT)
CN103150335A (en) * 2013-01-25 2013-06-12 河南理工大学 Co-clustering-based coal mine public sentiment monitoring system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
俞昊旻: "文档部分重复检测研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105160014A (en) * 2015-09-24 2015-12-16 四川师范大学 Data processing method and apparatus
CN106503113A (en) * 2016-10-18 2017-03-15 安徽天达网络科技有限公司 A kind of data processing method based on LAN
CN107092618A (en) * 2016-10-27 2017-08-25 北京小度信息科技有限公司 A kind of information processing method and device
CN108108747A (en) * 2017-09-21 2018-06-01 西安交通大学 A kind of clustering method for the view-based access control model principle for solving big data cluster
CN107992534A (en) * 2017-11-23 2018-05-04 安徽科创智慧知识产权服务有限公司 The method that improved sort key sorts data set
CN108399205A (en) * 2018-01-31 2018-08-14 佛山市聚成知识产权服务有限公司 A kind of data high-speed processing conversion communication means and device
CN110609834A (en) * 2018-05-29 2019-12-24 西安电子科技大学 Multi-source heterogeneous government affair data extraction system based on Agent
CN110609834B (en) * 2018-05-29 2023-04-18 西安电子科技大学 Multi-source heterogeneous government affair data extraction system based on Agent
CN109803022A (en) * 2019-01-30 2019-05-24 浙江蓝鸽科技有限公司 A kind of digitalization resource shared system and its method of servicing
CN109803022B (en) * 2019-01-30 2022-02-18 浙江蓝鸽科技有限公司 Digital resource sharing system and service method thereof
CN113435199A (en) * 2021-07-18 2021-09-24 谢勇 Storage and reading interference method and system for character corresponding culture
CN113435199B (en) * 2021-07-18 2023-05-26 谢勇 Storage and reading interference method and system for character corresponding culture

Similar Documents

Publication Publication Date Title
CN104182465A (en) Network-based big data processing method
CN102193936B (en) Data classification method and device
CN105468744B (en) Big data platform for realizing tax public opinion analysis and full text retrieval
CN106489149A (en) A kind of data mask method based on data mining and mass-rent and system
CN109726393B (en) Policy analysis system and method based on natural language processing technology
CN111899089A (en) Enterprise risk early warning method and system based on knowledge graph
CN102542061B (en) Intelligent product classification method
CN112650848A (en) Urban railway public opinion information analysis method based on text semantic related passenger evaluation
CN103020159A (en) Method and device for news presentation facing events
CN107577724A (en) A kind of big data processing method
CN106599160A (en) Content rule base management system and encoding method thereof
CN103324700A (en) Noumenon concept attribute learning method based on Web information
CN104537341A (en) Human face picture information obtaining method and device
CN106844782B (en) Network-oriented multi-channel big data acquisition system and method
CN106649718B (en) A kind of big data acquisition and processing method for PDM system
CN104008107A (en) Implement method of knowledge base on operation and maintenance management
CN111143394B (en) Knowledge data processing method, device, medium and electronic equipment
CN105095436A (en) Automatic modeling method for data of data sources
CN104268214B (en) A kind of user's gender identification method and system based on microblog users relation
CN103226577A (en) News clustering method
KR102345410B1 (en) Big data intelligent collecting method and device
CN112084448A (en) Similar information processing method and device
CN103870567A (en) Automatic identifying method for webpage collecting template of vertical search engine in cloud computing
CN105631634A (en) Cross-terminal interactive logistics big data real-time analysis system
CN206224473U (en) Information collection system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
EXSB Decision made by sipo to initiate substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20141203

RJ01 Rejection of invention patent application after publication