CN104182465A

CN104182465A - Network-based big data processing method

Info

Publication number: CN104182465A
Application number: CN201410348409.2A
Authority: CN
Inventors: 贾岩
Original assignee: ANHUI HUAZHEN INFORMATION SCIENCE & TECHNOLOGY Co Ltd
Current assignee: ANHUI HUAZHEN INFORMATION SCIENCE & TECHNOLOGY Co Ltd
Priority date: 2014-07-21
Filing date: 2014-07-21
Publication date: 2014-12-03

Abstract

The invention provides a network-based big data processing method which is characterized by collecting data from the internet, classifying and clustering the data and establishing a big database. The method comprises steps as follows: a data collection webpage is customized according to the expected target; principal data blocks of the webpage are determined according to the webpage structure, and webpage data extraction templates are automatically generated for extracting webpage data; the webpage data are uniformly encoded, repeated data are normalized, and data are screened; the data are divided into N data classes according to a preset classification model; data are clustered according to a preset clustering algorithm; and according to the classification and clustering result, the data are uniformly stored, indexes are established, and a big database is formed. According to the network-based big data processing method provided by the invention, the webpage data can be effectively extracted, the repeated information is normalized, and a user can conveniently and effectively utilize the webpage data.

Description

A kind of network large data processing method

Technical field

The present invention relates to information extraction technique field, relate in particular to a kind of network large data processing method.

Background technology

Information extraction field is an emerging research field, generally refers to from a given collection of document and automatically identifies the type informations such as predefined entity, relation and event, and the process of these information being carried out to structured storage and management.Information extraction all has important application in a lot of fields.

In recent years, along with the development of network, the information on internet is more and more.Nearly all network information is all to present to user's with the form of structuring or semi-structured text.Web page information extraction is exactly extract for information about and carry out structuring processing what comprise in webpage, makes it to become the organizational form that form is the same.The main task of info web is exactly that predetermined information point is extracted from various webpages, then integrates with unified form, conveniently checks and compares.

On the internet, the information of same subject disperses to leave on different websites conventionally, and the form of performance is also different, in prior art, is difficult to complete the web mining of expection.In addition, on internet, information is reprinted frequent, how to realize the normalizing of duplicate message, is also a key.

Summary of the invention

The technical matters existing based on background technology, the present invention proposes a kind of network large data processing method, can effectively extract web data, and duplicate message is carried out to normalizing, facilitates the effective utilization of user to web data.

The present invention propose the network large data processing method of one, Internet image data, and to data classify, cluster, set up large database concept, comprising:

Gather webpage according to re-set target customization data;

According to structure of web page, determine webpage body data block, generating web page data pick-up template extracts web data automatically;

Web data is carried out to Unified coding, by repeating data normalizing, garbled data;

According to prefabricated disaggregated model, data are divided into N data class;

According to prefabricated clustering algorithm, data are carried out to cluster;

According to classification and cluster result, index is stored and is set up in data unification, form large database concept.

Preferably, gather webpage according to re-set target customization data, comprising:

In prefabricated industry, webpage is as data source;

The network probe of built-in domain body is set, automatically finds with body related web page as collection point;

Preferably, web data is carried out to Unified coding, by repeating data normalizing, garbled data, specifically comprises:

Each section of text encoded;

Carry out segmentation contrast according to coding, judge Data duplication degree;

By repeating data normalizing, garbled data.

Preferably, according to classification and cluster result, index is stored and is set up in data unification, form large database concept, be specifically divided into:

N data class carried out to cluster;

The data that comprise in each data class are carried out to cluster.

The present invention extracts the mode of web data, and efficiency is high, and recall ratio is good, avoids information to omit; Can effectively eliminate duplicate message, greatly reduce data and taken up space, eliminate redundancy, reduce the load of subsequent treatment, improve data-handling efficiency; Prefabricated disaggregated model and clustering algorithm, carry out Classification and clustering analysis to data, and data unified storage building database building database index, facilitate management, the search and use of user to extracted data.

Brief description of the drawings

Fig. 1 is the process flow diagram of a kind of network large data processing method of proposing of the present invention.

Embodiment

With reference to Fig. 1, the present invention propose the network large data processing method of one, Internet image data, and to data classify, cluster, set up large database concept, comprise the following steps:

Gather webpage according to re-set target customization data;

In present embodiment, Automatic Extraction web data, efficiency is high, and image data is comparatively comprehensive, avoids information to omit, and data Unified coding is carried out, after repeating data normalizing, having greatly reduced data and having taken up space, and eliminates redundancy, has reduced the load of subsequent treatment.In present embodiment, data are carried out to Classification and clustering, then according to Classification and clustering result building database index, facilitate management, the search and use of user to extracted data.

In present embodiment, gather webpage according to re-set target customization data, there are two kinds of modes in the source that gathers webpage, is respectively:

In prefabricated industry, webpage is as data source;

The network probe of built-in domain body is set, automatically finds with body related web page as collection point.

The prefabricated webpage of paying close attention to user's expection of data source, makes the draw-off direction of web data more pointed, is conducive to improve data acquisition efficiency.Collection point can be at last to the supplementing of data source, improve the recall ratio of data acquisition.The complementation of data source and collection point, can make data acquisition efficiency and recall ratio reach a more satisfactory balance.

In present embodiment, web data is carried out to Unified coding, by repeating data normalizing, garbled data, specifically comprises:

Each section of text encoded;

By repeating data normalizing, garbled data.

This text carries out segment encoding, and carries out segmentation contrast, can effectively find that text repeats degree, avoids omitting.

In present embodiment, according to classification and cluster result, index is stored and is set up in data unification, form large database concept, be specifically divided into:

N data class carried out to cluster;

The data that comprise in each data class are carried out to cluster.

According to classification results, database is divided into topic, two ranks of data class, two kinds of cluster analyses carrying out on this basis, database can be subdivided into topic, topic bunch, data class, data class bunch four ranks, further set up Indexing Mechanism, make the management of user to database, retrieve, utilize convenient.

The above; it is only preferably embodiment of the present invention; but protection scope of the present invention is not limited to this; any be familiar with those skilled in the art the present invention disclose technical scope in; be equal to replacement or changed according to technical scheme of the present invention and inventive concept thereof, within all should being encompassed in protection scope of the present invention.

Claims

1. a network large data processing method, is characterized in that, Internet image data, and to data classify, cluster, set up large database concept, comprising:

Gather webpage according to re-set target customization data;

2. network large data processing method as claimed in claim 1, is characterized in that, gathers webpage according to re-set target customization data, comprising:

In prefabricated industry, webpage is as data source;

3. network large data processing method as claimed in claim 1, is characterized in that, web data is carried out to Unified coding, and by repeating data normalizing, garbled data, specifically comprises:

Each section of text encoded;

By repeating data normalizing, garbled data.

4. network large data processing method as claimed in claim 1, is characterized in that, according to classification and cluster result, index is stored and set up in data unification, forms large database concept, is specifically divided into:

N data class carried out to cluster;

The data that comprise in each data class are carried out to cluster.