CN105988994A

CN105988994A - Web field distributed real time extraction system

Info

Publication number: CN105988994A
Application number: CN201410842253.3A
Authority: CN
Inventors: 刘挺; 孟小峰
Original assignee: Beijing Consulting Data Technology Co Ltd
Current assignee: Beijing Consulting Data Technology Co Ltd
Priority date: 2015-02-06
Filing date: 2015-02-06
Publication date: 2016-10-05

Abstract

The invention provides a novel deep data mining method for extracting data from various websites. The method can automatically and directionally extract a designated website in all weather; an adopted language is css/html or jquery format; an automatic module can perform data extraction on a website stored in a list according to a certain frequency, and can set the number of times of extraction or automatically recycle without limit; a distributed crawler module can perform extraction on a website according to a certain frequency and can perform multi-level extraction; the extracted content passes a filtering layer, missing value compensation is performed on the extracted content, and filtering is performed on extracted values, and finally, the result is stored in a local database or a remote database according to a set condition. The Web field distributed real time extraction system can achieve distributed all-weather automatic data extraction and integration.

Description

The distributed real-time extraction system in Web field

Art

The present invention relates to a kind of Deep web data integrated approach and system, particularly to one based on distributed data integrating method and system.

Background technology

The most traditional search engine such as baidu, google, be only capable of the data searching shallow hierarchy, exist substantial amounts of in the data within website, such as portal website, forum at present on network, mhkcs etc., referred to as Deep Web, these data cannot be arrived by traditional search engine retrieving.In order to overcome traditional search engine can not the deficiency of degree of depth mining data, system uses up-to-date technology, invents a kind of method the data pick-up of the required extraction within each website out, is integrated into oneself website, it is provided to user retrieve, facilitates user to surf the Net inquiry.

Summary of the invention

The invention provides and a kind of distributed carry out automatically data pick-up and integrated system and method to each big website data, the present invention includes two modules.

First module is automatization's module:

1. capture list of websites module

2. capture frequency setting module

Second module is distributed network reptile module:

1. frequency setting module

2. capture preparation module

3. capture content module

3.1 filter contents modules

3.2 ground floor contents capture implements module

3.3 second layer data grabber lists

3.4 second layer contents capture implements module

4. data base arranges module

4.1 local data lab setting modules

4.2 teledata lab setting modules

The data pick-up method that the present invention provides, including following step:

(1) data base's configuration: set data base's configuration of host side and distributed reptile end, enables the database server of host side to accept the remote data storage of carrying out and the renewal of distributed reptile machine end.

(2) program execution automatization module:

1. program reads and captures list of websites list, crawls the profile list of website required for therefrom extracting；

2. program reading frequency setting, includes whether to carry out the most circulating permanent crawl, if do not circulated, needs the interval time having a rest between the number of times of earning, and each crawl；

3. a website in list is captured by program；

4. after the crawl of websites all in list is finished by program, rest a period of time, then the website of this list is captured；

5. program is to proposed in 4, may carry out non-stop crawl, it is also possible to captures and the most just stopped；

(3) program execution distributed reptile module:

1. program reads the configuration information of single website, including capturing frequency, captures and prepares, captures content, and data base configures；

2. program is according to capturing defined in frequency module, reads and starts the page. the step-length that each stage captures, the step number that each stage captures, the number of times in each stage of circulation, need the time paused inside each circulation；

3. program reads and captures preparation module, and analog subscriber uses http or https to send request to website；

4. program reads and captures content: by return information carries out a secondary data process, mainly ground floor data grabber；

5. program reads and captures content: website is carried out second layer data grabber；

6. the content captured is screened by program: includes filling missing values, filters out some redundant content etc.；

7. program is according to data base configuration file, and content is saved in the lane database of local data base or server end；

8. program rest a period of time, carry out crawling of next website according to the step number in frequency configuration and step width；

9. when step number reaches predetermined value, according to the frequency in frequency configuration, this website being carried out crawl again, starting point is the webpage just terminated.

The invention has the beneficial effects as follows and the data pick-up of the required extraction within each website out can be integrated into oneself website, be provided to user and retrieve, facilitate user to surf the Net inquiry.

Accompanying drawing explanation

Fig. 1 is the system architecture diagram of the data integrated system according to the present invention.

The Organization Chart of Tu2Shi automatization of the present invention module.

The flow chart of Tu3Shi automatization of the present invention module.

Fig. 4 is each internal module figure of distributed reptile of the present invention.

Fig. 5 is the flow chart of distributed reptile of the present invention.

Detailed description of the invention

One, data base arranges module:

(1) database server at server end increases order line, is allowed to allow the ip of distributed reptile place computer, and the teledata that this server end place database server is carried out by an its own specific user name writes and updates.

(2) at reptile end computer, remote data base log-in module, set the remote server needing to log in, the IP address of its computer and data base's port, needing the name of the data base logged in, need the user name used, the server end in i.e. the 1st article allows this specific user name, need the password logged in, need that title logged in.

(3) in reptile end computer, local data base log-in module, IP address and data base's port of the local computing that needs log in is set, need the name of the data base logged in, need the user name used, usually Root, need the password logged in, need the title of that form logged in.

Two, program performs automation equipment and arranges module:

Set and need automatization listing file list.xml,

1.site: the configuration filename of the website that needs crawl and position

2.for_ever_run: determine whether that carrying out permanent loops state captures

3.query_frequence: if not carrying out the crawl of permanent loops state, then the number of times captured

4.sleep_time: after every time having captured the website of a list, the length of one's sleep carried out

Three, program execution distributed reptile module:

(1) set reptile crawls frequency

1.start_page: reptile starts first page captured

2.query_frequence: capture the frequency of the page

The page needing the page captured to stride forward to the next one after 3.query_step_interval: one webpage capture is spaced

4.query_step_times: each round captures the number of steps strided forward of the page

Time of having a rest after the crawl of 5.sleep_time: each round

(2) preparatory stage of reptile is set:

1.charset: capture the coded format of website

2.query_page_front: capture the front end of link, do not comprise the part of the page number

3.query_page_end: capture the rear end of link, do not comprise the part of the page number

(3) the extraction content of reptile is set:

1.filter_list: filter list

Node: each node needing to put into data base's list

Name: put into the title of list

Default_value: if value disappearance, use this value to fill missing values

Regex: use regular language that the value taken out is filtered, the information filtering comprised is fallen.

2.query_level_nodes: need the node list of the content captured

First_level_nodes: ground floor needs the node list captured

Top_node: the path of the top of the node of required extraction, describes with the grammer of css/html, jquery

Common_path: path, basis, is the situation of relative path for some information extracted, and relative path completion can be absolute path by the content of this node

Nodes: i.e. need to extract the set of node of content,

Node: each node needing Extracting Information

Name: need to extract the name of content

Select: needing to extract the relative path of the select of content, every layer separates with $

Attribute: need to extract the attribute part of content

Text: need the text part extracting node the need of being extracted

Seconde_level_node_list: needs carry out the list of the second layer minor node captured

Node: need to carry out the list node that second time captures, captures from ground floor and chooses the list of node

Seconde_level_nodes: need the second layer minor node list carrying out capturing

The top mode needing to capture of top_seconde_level_node: the second layer

Nodes: need the second layer set of node captured

Node: second layer node

Name: need the name of the node of extraction

Select: need the select path of the node of extraction

Attribute: need the attribute of the node of extraction

Text: need the text part of the node of extraction

(4) set data base and configure content:

1. local data base configuration:

Database_connect: whether connect local data base

Hostname: connect the host of data base

Database_name: connect wide area information server name

Userid: connect the user name of data base

Password: connect the password of data base

Table_name: need to write the title of the form of data

2. remote data base configuration:

Remote_database_connect: whether log in remote data base

Remote_hostname: the host of the remote data base logged in:

Data base's name of remote_database_name: the remote data base logged in

Remote_userid: log in the user name that remote data base uses

Remote_password: log in the password that remote data base uses

Remote_table_name: log in the certain table name inside remote data base.

Claims

1.Web field distributed real-time data extraction system, it is characterised in that comprise the following steps:

Read the information inside the website list of file names needing extraction；

According to information, according to certain frequency, website (forum, portal website etc.) is carried out data pick-up, the language css/html of employing Or jquery form；

The content of extraction being carried out single treatment, fills a vacancy including missing values, filter is arranged；

According to configuration file, the content after processing is sent to locally or remotely in data base.

A kind of data integrating method the most according to claim 1, needs the website list of file names captured to write on construct.xml and joins Put in file, during the configuration file of website to be crawled is all according to list writes on list.xml one by one, including Site stores the path of each website configuration file construct.xml.

A kind of reptile frequency adjustment method the most according to claim 1, if using circulation system, will set if do not used The number of times captured and the time of having a rest every time captured.

Frequency regulation block the most according to claim 2, arranges the beginning page including start the page, the step-length every time captured, often The step number that wheel captures, captures the number of round, and needs certain time of having a rest after captured every time.

The crawl preparatory stage the most according to claim 2, the coded system of captured website will capture network address leading portion, back segment and page Code is spliced into the mode capturing link.

Content extraction detailed rules and regulations aspect the most according to claim 2, ground floor, the crawl of second layer content arranges top_node, Representated by it, path select $ separates.

Content extraction detailed rules and regulations the most according to claim 2, each node, include name, select；Attribute, text Four options, wherein select indicates the path for select, and it meets css/html or jquery form, the road of every layer Footpath uses $ to separate, and fills in the name of the node needing extraction in attribute, and whether text explanation selects text.

Content extraction detailed rules and regulations the most according to claim 2, the list of the node of second layer extraction is inside second layer list node.

The most according to claim 3, the content of the first second layer node that extraction comes up needs, through a filter mechanism, to comprise this The name of node, automatic Filling power during disappearance, automatic Filling power during for filling this value disappearance, filters content, i.e. extracts The value come up filters, it would be desirable to the value regex form filtered writes in this frame.

Data base the most according to claim 4 configures detailed rules and regulations, is divided into two parts, and one is remote data base configuration, including being The no remote data base that logs in, the name of remote data base, long-range main frame, long-range user name, long-range password, long-range table Lattice name, another is local data base configuration, includes whether to log in local data base, local host, local data library name, User name, password, list name.