The content of the invention
In order to solve the deficiencies in the prior art, the first object of the present invention provides a kind of multichannel big data of network-oriented
Acquisition system.
A kind of multichannel big data acquisition system of network-oriented of the invention, the multichannel big data of the network-oriented
Acquisition system is by forum data collecting unit, blog data collecting unit, news data collecting unit and relevant database
The distributed oriented acquisition architectural framework composition that data acquisition unit is constituted;
The forum data collecting unit, for passing through dynamic web page acquisition method and method for abstracting web page information pair respectively
Network data in online forum and offline forum is acquired;
The blog data collecting unit, for being responsible for breadth traversal bog site, it is therefore an objective to obtain blog Feed addresses;
Real-time Collection is carried out to each corresponding blog in Feed addresses, the blog articles for updating are tracked, gathers rich in incremental update mode
Objective information;
The news data collecting unit, for use in the method extraction news web page based on row block distribution function just
Text, and then obtain news data;
Data acquisition unit in the relevant database, for using data transfer tool come batch capture relationship type number
According to data in storehouse.
Further, in the forum data collecting unit, the network data in forum is entered by base unit of column
Row collection, by the acquisition of column webpage, the extraction of column info web, model webpage is obtained and model info web extracts this four
In the stage, web retrieval and info web are extracted combine and obtain network data in forum.
The present invention is be combined with each other by dynamic web page highly effective gathering technology and Web page information extraction technology, in real time, comprehensively,
Accurately acquire the model and its relevant meta information specified in specified forum website in column.
Further, the blog data collecting unit is found that device and multiple information acquisition devices are constituted by a Feed, institute
State Feed and find that device is used to obtain the URL addresses or RSS addresses of blog and the linking relationship by each blog page goes to obtain
Take the URL addresses or RSS addresses of other blogs;The collector is used to carry out blog the refreshing collection of increment, and extracts
The new blog article information delivered, generates corresponding blog article information record and is put in storage.
The present invention is capable of the blog data of Real-time Collection renewal so that data acquisition is in real time and accurate.
Further, the news data collecting unit includes webpage html source code pretreatment module, and it is used for webpage
Html source code is encoded, goes script and spcial character to be processed;And
Format tags remove module, and it is used to be removed format tags to pretreated webpage html source code, obtains
Coarse Web page text;And
Text extraction module, it is used to extract treating in coarse Web page text using the default distribution function for seeking row number of words
Web page text is obtained, and then obtains news data.
The present invention intuitively can efficiently and accurately obtain news data.
Further, in data acquisition unit in the relevant database, the data transfer tool is Sqoop.
Sqoop is an instrument for the data in Hadoop and relevant database are mutually shifted, can be by one
Number in individual relevant database (such as relevant database of any support JDBC specifications of MySQL, Oracle, Postgres)
According in the HDFS for importing Hadoop.For some NoSQL databases, it also provides connector.Sqoop is similar to other ETL works
Tool, data type is judged using metadata schema and the number of type safety is ensured when data are transferred to Hadoop from data source
According to treatment.Sqoop aims at the design of big data bulk transfer, partitioned data set and can create Hadoop tasks and processes each area
Block.
The second object of the present invention is to provide a kind of multichannel big data acquisition method of network-oriented.
A kind of multichannel big data acquisition method of network-oriented of the invention, it uses distributed oriented acquisition system frame
Structure is to forum data, blog data, news data and data be distributed collection simultaneously in relevant database;
Wherein, respectively by dynamic web page acquisition method and method for abstracting web page information in online forum and offline forum
Network data be acquired;
During blog data, first, blog Feed addresses are obtained;Then, it is corresponding to each Feed address rich
Visitor carries out Real-time Collection, tracks the blog articles for updating, and blog information is gathered in incremental update mode;
Body text in news web page is extracted using the method based on row block distribution function, and then obtains news data;
Using data transfer tool come data in batch capture relevant database.
Further, during collection forum data is adopted, the network data in forum is entered by base unit of column
Row collection, by the acquisition of column webpage, the extraction of column info web, model webpage is obtained and model info web extracts this four
In the stage, web retrieval and info web are extracted combine and obtain network data in forum.
Further, collection forum data adopt during, using Feed find device obtain blog URL addresses or
RSS addresses simultaneously are gone to obtain the URL addresses or RSS addresses of other blogs by the linking relationship of each blog page;Using adopting
Storage carries out the refreshing collection of increment to blog, and extracts the new blog article information delivered, and generates corresponding blog article information record simultaneously
Storage.
Further, the detailed process of collection news data includes:
Webpage html source code is encoded, goes script and spcial character to process;
Format tags are removed to pretreated webpage html source code, coarse Web page text is obtained;
Extract the Web page text to be obtained in coarse Web page text using the default distribution function for seeking row number of words, and then
To news data.
Further, the data transfer tool is Sqoop.
Compared with prior art, the beneficial effects of the invention are as follows:
(1) in face of the network information and its diversified message form of magnanimity, present invention distribution oriented acquisition system
Framework is to forum data, blog data, news data and data be distributed collection simultaneously in relevant database, has reached standard
Really recognize, extract the information of separate sources and form, but efficiently, comprehensively gather information, additionally it is possible to tracking information in time
Renewal, and reduce the workload of manual maintenance big data.
(2) present invention ensure that the high efficiency of grid information access process, comprehensive, promptness to greatest extent,
For upper strata analysis and processing module provides information source comprehensively, stable, safe.
Specific embodiment
It is noted that described further below is all exemplary, it is intended to provide further instruction to the application.Unless another
Indicate, all technologies used herein and scientific terminology are with usual with the application person of an ordinary skill in the technical field
The identical meanings of understanding.
It should be noted that term used herein above is merely to describe specific embodiment, and be not intended to restricted root
According to the illustrative embodiments of the application.As used herein, unless the context clearly indicates otherwise, otherwise singulative
Be also intended to include plural form, additionally, it should be understood that, when in this manual use term "comprising" and/or " bag
Include " when, it indicates existing characteristics, step, operation, device, component and/or combinations thereof.
Fig. 1 is a kind of multichannel big data acquisition system structural representation of network-oriented of the invention.
As shown in figure 1, a kind of multichannel big data acquisition system of network-oriented of the invention, gathers single by forum data
The distribution that data acquisition unit is constituted in unit, blog data collecting unit, news data collecting unit and relevant database
Oriented acquisition architectural framework is constituted;
The forum data collecting unit, for passing through dynamic web page acquisition method and method for abstracting web page information pair respectively
Network data in online forum and offline forum is acquired;
The blog data collecting unit, for being responsible for breadth traversal bog site, it is therefore an objective to obtain blog Feed addresses;
Real-time Collection is carried out to each corresponding blog in Feed addresses, the blog articles for updating are tracked, gathers rich in incremental update mode
Objective information;
The news data collecting unit, for use in the method extraction news web page based on row block distribution function just
Text, and then obtain news data;
Data acquisition unit in the relevant database, for using data transfer tool come batch capture relationship type number
According to data in storehouse.
For the method that the feature of different type network, the present invention use oriented acquisition, with the terminal station in heterogeneous networks
Point for information gathering basic task unit, each acquisition tasks can using independent collection rule and strategy (such as depth,
Collection renewal frequency, information extraction template etc.).For network data acquisition in the requirement of the aspect such as scale and flexibility, use
The distributed oriented acquisition architectural framework of " master slave distribution, autonomous collaboration ".
In face of the network information and its diversified message form of magnanimity, should recognize exactly, extract separate sources and
The information of form, again efficiently, comprehensively gather information, will also tracking information in time renewal, and subtract as far as possible
Reduced-maintenance workload.Therefore, the present invention is visited using newest vertical search template Semi-automatic Generation, dynamic page optimization
Technology and intelligentized crawl process scheduling policy are asked, the high efficiency of grid information access process, complete is ensured to greatest extent
Face property, promptness, for upper strata analysis and processing module provides information source comprehensively, stable, safe.
The present invention is be combined with each other by dynamic web page highly effective gathering technology and Web page information extraction technology, in real time, comprehensively,
Accurately acquire the model and its relevant meta information specified in specified forum website in column.
The information source that forum information is obtained is with column as base unit.Certain column is given, to the letter of the column
Breath is obtained and mainly include four-stage (may be parallel in actual motion):Column webpage acquisition → column web page extraction → model net
Page acquisition → model web page extraction.As shown in Figure 2.
Gathered based on column entrance, the column of collection needed for can directly positioning, this obtains demand with the orientation of data
Exactly identical.Obtained by column webpage, column info web is extracted, model webpage is obtained and model info web is extracted
Four-stage, web retrieval is extracted with info web and is organically combined, and is efficiently solved conventional information acquisition technique and is deposited
Problems.
There is the index list of model in the column page of forum, the metamessage of abundant model has been contained in list.List
In every a line have recorded a constituent element information of model, including the theme of model, the people that posts, the time of posting, hits, return
Plural number etc..These metamessages are very important for the data analysis of forum.The institutional framework of the column page generally compares to be had
Rule, the metamessage of model can be effectively extracted based on the column page.The method is divided into two parts:1. from the column page
Extract metadata, why referred to as metadata is rather than metamessage because these data implication (such as title, post people
Deng) be not aware that;2. Metadata integration is put in storage:The implication (referred to as to the parsing of metadata) of metadata is identified, makes unit
Data turn into real metamessage, preserve storage.The flow of whole method is as shown in Figure 3.
For the extraction of metadata, off-line operation includes:User provides a column page as the sample page, by nothing
Supervised learning method, is that the column page similar with training examples generates a template.On-line operation includes:It is right according to template
The new column page carries out Metadata Extraction.The extraction of metadata is operated based on DOM.Extraction process makes full use of the column page
The corresponding relation between the node in attribute and dom tree in middle model record, record, and these nodes are in institutional framework
Characteristic.Above-mentioned abstracting method has the advantages that extraction efficiency high, accurate positioning, maintenance cost are relatively low.
In specific implementation process, basic task unit also includes blog data collecting unit, and it is used to be responsible for range time
Go through bog site, it is therefore an objective to obtain blog Feed addresses;Real-time Collection is carried out to each corresponding blog in Feed addresses, tracking is more
New blog articles, blog information is gathered in incremental update mode.
Using system architecture as shown in Figure 4, system uses Distributed Design, has a Feed to find device and multiple information
Collector.The target of Feed discovery modules is intended to find as much as possible RSS the or Atom addresses of blog below BSP.Pass through
Analysis finds the URL addresses or RSS addresses of each BSP blog, it is found that they all have certain specification, can be by this
Strategy recognizes whether a page is blog page, then goes to find by the linking relationship of each blog page more rich
Visitor.
Collector is responsible for carrying out blog the refreshing collection of increment, and extracts the new blog article information delivered, and generation is corresponding
Blog article information record is simultaneously put in storage.Function is as shown in Figure 5.The present invention is capable of the blog data of Real-time Collection renewal so that data are adopted
Collection is in real time and accurate.
During news data is obtained, the text in news web page is extracted using the method based on row block distribution function
Text, and then obtain news data.
The groundwork of body data pick-up be from Web included without in structure or semi-structured information recognize
User's information interested and be translated into that structuring is strong, the meaning of one's words clearly data.The input of information extraction system is original
Text, output is the information of set form.Finally, stored to relational database the data for extracting through over cleaning and after arranging
In, accurately inquired about and pattern extraction for further data.
For convenience of the Chinese in effective crawl news web page, using in the method extraction webpage based on row block distribution function
Body text, obtain document core content.Text extracting framework based on row block distribution function method is as shown in Figure 6.
In HTML, text is together with label always doping.Undeniably, modification of the label to word is true in word power
There is great role on fixed and ranking results.But, also just because of html tag and text be interleaved with each other it is complicated and lack of standardization, make
Obtain general text extracting to become difficult to achieve, finally have to define Different Rule, Space-time Complexity for different web sites
Have a greatly reduced quality.
Based on this, the present invention proposes a kind of universal method based on row block distribution function, can be in linear session O (N)
Extract text out.Propose that the method core foundation has at 2 points:1st, the density of text area, 2, the length of row block.
According to 1:The text region of one webpage is that text information is distributed one of most intensive region certainly, and this region can
Can be maximum but exactly so, such as comment information is more long, or Web page text news is shorter, and the big tight navigation information of a piece occurs
When, the region that text also occurs is not possible of largest block.
According to 2:The length information of row block can effectively solve the above problems.
It is combined according to 1 and according to 2, just can well realizes that text is extracted.Block distribution letter of being expert at will be merged according to 1 and 2
In several.It is specific as follows:
Webpage HTML is removed into label first, all texts are only stayed, while leaving all blank positions after label removal
Information, the text for leaving referred to as Ctext.
Define 1. row blocks:
With the line number in Ctext as axle, K rows (context, K around it is taken<5, K=3 is taken here, direction is downward, K
Referred to as row block thickness), a collectively row block Cblock, row block i is row blocks of the line number i with Ctext as axle;
Define 2. row block lengths:
One Cblock, remove all blank characters therein (n, r, t etc.) after character sum be referred to as the length of the row block
Degree;
Define 3. row block distribution functions:
With Ctext per behavior axle, LinesNum (Ctext)-K Cblock is had, made with [1, LinesNum
(Ctext)-K] it is transverse axis, with its distribution function of the respective row block length as the longitudinal axis;
Row block distribution function can be tried to achieve in O (N) times, and text institute can be intuitively found out on block distribution function figure of being expert at
In region.By above-mentioned row block distribution function figure it will be evident that correctly text filed is all containing most on distribution function figure
Value and a continuous region, this region often containing one rise sharply a little with a rapid drawdown point.
Then, Web page text extracting problem is converted to seek two boundary points of rapid drawdown that rise sharply on row block distribution function, this
Region contained by two boundary points contains the row block length maximum of current web page and is continuous.
Powerful row block X where asking text regionstartWith middle finger row block Xend(X is line number, and Y (X) is with X as axle
Row block length), it is necessary to meet following four condition:
(1)Y(Xstart)>Y(Xt)(Y(Xt) be first and rise sharply a little, rises sharply a certain threshold value that a little must be over);
(2)Y(Xn(n ∈ [start+1, start+K], K are row block thickness, and following the row block length for rising sharply a little closely can not for) ≠ 0
It is 0, it is to avoid noise);
(3)Y(Xm)=0 (m ∈ [end, end+1], the row block length that rapid drawdown clicker is trailed is 0, it is ensured that end of soliciting articles);
(4) there is X, when max (Y (X)) is got, X ∈ [Xstart,Xend] (ensure that this region is channel row block maximum
Region).
The present invention intuitively can efficiently and accurately obtain news data.
Wherein, news data collecting unit of the invention includes webpage html source code pretreatment module, and it is used for webpage
Html source code is encoded, goes script and spcial character to be processed;And
Format tags remove module, and it is used to be removed format tags to pretreated webpage html source code, obtains
Coarse Web page text;And
Text extraction module, it is used to extract treating in coarse Web page text using the default distribution function for seeking row number of words
Web page text is obtained, and then obtains news data.
The present invention intuitively can efficiently and accurately obtain news data.
Collect in specific implementation process, in data acquisition unit in the relevant database, the data shift work
It is Sqoop to have.
Sqoop is an instrument for the data in Hadoop and relevant database are mutually shifted, can be by one
Number in individual relevant database (such as relevant database of any support JDBC specifications of MySQL, Oracle, Postgres)
According in the HDFS for importing Hadoop.For some NoSQL databases, it also provides connector.Sqoop is similar to other ETL works
Tool, data type is judged using metadata schema and the number of type safety is ensured when data are transferred to Hadoop from data source
According to treatment.Sqoop aims at the design of big data bulk transfer, partitioned data set and can create Hadoop tasks and processes each area
Block.
Hadoop frameworks are made up of distributed file system HDFS and MapReduce;HDFS is the file system of Hadoop,
For storing super large file;MapReduce is the parallel programming model of Hadoop, for carrying out depth to the data stored on HDFS
Degree analysis.
Hadoop realizes a distributed file system (Hadoop Distributed File System), referred to as
HDFS.HDFS most begins as the architecture of Apache Nutch search engine projects and develops.
HDFS is mainly made up of Client, Datanode and Namenode, and its framework is as shown in Figure 7.One uses
In the cluster of Hadoop Technical Architectures, typically there are one to two main frames as Namenode, some main frames are used as Datanode.
Client is represented and is used the CLIENT PROGRAM of HDFS;Namenode is a main frame in Hadoop clusters, is responsible for preserving data section
The tasks such as information, the distribution of calculating task and the final stipulations put;Datanode is responsible for data storage with treatment.To ensure number
According to security, HDFS moderately increased redundant data.Specific way is that same data are preserved in different Datanode
Multiple copies, generally three parts copy.
The present invention faces the network information and its diversified message form of magnanimity, present invention distribution oriented acquisition body
System structure is to forum data, blog data, news data and data be distributed collection simultaneously in relevant database, reaches
Recognize, extract the information of separate sources and form exactly, but efficiently, comprehensively gather information, additionally it is possible to tracking letter in time
The renewal of breath, and reduce the workload of manual maintenance big data.
The present invention ensure that the high efficiency of grid information access process, comprehensive, promptness to greatest extent, for upper
Layer analysis processing module provides information source comprehensively, stable, safe.
A kind of multichannel big data acquisition method of network-oriented of the invention, it uses distributed oriented acquisition system frame
Structure is to forum data, blog data, news data and data be distributed collection simultaneously in relevant database;
Wherein, respectively by dynamic web page acquisition method and method for abstracting web page information in online forum and offline forum
Network data be acquired;
During blog data, first, blog Feed addresses are obtained;Then, it is corresponding to each Feed address rich
Visitor carries out Real-time Collection, tracks the blog articles for updating, and blog information is gathered in incremental update mode;
Body text in news web page is extracted using the method based on row block distribution function, and then obtains news data;
Using data transfer tool come data in batch capture relevant database.
Specifically, during collection forum data is adopted, the network data in forum is carried out by base unit of column
Collection, by the acquisition of column webpage, the extraction of column info web, model webpage is obtained and model info web extracts this four ranks
Section, web retrieval and info web are extracted combine and obtain network data in forum.
The present invention is be combined with each other by dynamic web page highly effective gathering technology and Web page information extraction technology, in real time, comprehensively,
Accurately acquire the model and its relevant meta information specified in specified forum website in column.
Specifically, collection forum data adopt during, using Feed find device obtain blog URL addresses or
RSS addresses simultaneously are gone to obtain the URL addresses or RSS addresses of other blogs by the linking relationship of each blog page;Using adopting
Storage carries out the refreshing collection of increment to blog, and extracts the new blog article information delivered, and generates corresponding blog article information record simultaneously
Storage.
The present invention is capable of the blog data of Real-time Collection renewal so that data acquisition is in real time and accurate.
Specifically, the detailed process of collection news data includes:
Webpage html source code is encoded, goes script and spcial character to process;
Format tags are removed to pretreated webpage html source code, coarse Web page text is obtained;
Extract the Web page text to be obtained in coarse Web page text using the default distribution function for seeking row number of words, and then
To news data.
Wherein described data transfer tool is Sqoop.
Sqoop is an instrument for the data in Hadoop and relevant database are mutually shifted, can be by one
Number in individual relevant database (such as relevant database of any support JDBC specifications of MySQL, Oracle, Postgres)
According in the HDFS for importing Hadoop.For some NoSQL databases, it also provides connector.Sqoop is similar to other ETL works
Tool, data type is judged using metadata schema and the number of type safety is ensured when data are transferred to Hadoop from data source
According to treatment.Sqoop aims at the design of big data bulk transfer, partitioned data set and can create Hadoop tasks and processes each area
Block.
The present invention faces the network information and its diversified message form of magnanimity, present invention distribution oriented acquisition body
System structure is to forum data, blog data, news data and data be distributed collection simultaneously in relevant database, reaches
Recognize, extract the information of separate sources and form exactly, but efficiently, comprehensively gather information, additionally it is possible to tracking letter in time
The renewal of breath, and reduce the workload of manual maintenance big data.
The present invention ensure that the high efficiency of grid information access process, comprehensive, promptness to greatest extent, for upper
Layer analysis processing module provides information source comprehensively, stable, safe.
Although above-mentioned be described with reference to accompanying drawing to specific embodiment of the invention, not to present invention protection model
The limitation enclosed, one of ordinary skill in the art should be understood that on the basis of technical scheme those skilled in the art are not
Need the various modifications made by paying creative work or deformation still within protection scope of the present invention.