CN105988994A - Web field distributed real time extraction system - Google Patents

Web field distributed real time extraction system Download PDF

Info

Publication number
CN105988994A
CN105988994A CN201410842253.3A CN201410842253A CN105988994A CN 105988994 A CN105988994 A CN 105988994A CN 201410842253 A CN201410842253 A CN 201410842253A CN 105988994 A CN105988994 A CN 105988994A
Authority
CN
China
Prior art keywords
extraction
website
node
name
list
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410842253.3A
Other languages
Chinese (zh)
Inventor
刘挺
孟小峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Consulting Data Technology Co Ltd
Original Assignee
Beijing Consulting Data Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Consulting Data Technology Co Ltd filed Critical Beijing Consulting Data Technology Co Ltd
Priority to CN201410842253.3A priority Critical patent/CN105988994A/en
Publication of CN105988994A publication Critical patent/CN105988994A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention provides a novel deep data mining method for extracting data from various websites. The method can automatically and directionally extract a designated website in all weather; an adopted language is css/html or jquery format; an automatic module can perform data extraction on a website stored in a list according to a certain frequency, and can set the number of times of extraction or automatically recycle without limit; a distributed crawler module can perform extraction on a website according to a certain frequency and can perform multi-level extraction; the extracted content passes a filtering layer, missing value compensation is performed on the extracted content, and filtering is performed on extracted values, and finally, the result is stored in a local database or a remote database according to a set condition. The Web field distributed real time extraction system can achieve distributed all-weather automatic data extraction and integration.

Description

The distributed real-time extraction system in Web field
Art
The present invention relates to a kind of Deep web data integrated approach and system, particularly to one based on distributed data integrating method and system.
Background technology
The most traditional search engine such as baidu, google, be only capable of the data searching shallow hierarchy, exist substantial amounts of in the data within website, such as portal website, forum at present on network, mhkcs etc., referred to as Deep Web, these data cannot be arrived by traditional search engine retrieving.In order to overcome traditional search engine can not the deficiency of degree of depth mining data, system uses up-to-date technology, invents a kind of method the data pick-up of the required extraction within each website out, is integrated into oneself website, it is provided to user retrieve, facilitates user to surf the Net inquiry.
Summary of the invention
The invention provides and a kind of distributed carry out automatically data pick-up and integrated system and method to each big website data, the present invention includes two modules.
First module is automatization's module:
1. capture list of websites module
2. capture frequency setting module
Second module is distributed network reptile module:
1. frequency setting module
2. capture preparation module
3. capture content module
3.1 filter contents modules
3.2 ground floor contents capture implements module
3.3 second layer data grabber lists
3.4 second layer contents capture implements module
4. data base arranges module
4.1 local data lab setting modules
4.2 teledata lab setting modules
The data pick-up method that the present invention provides, including following step:
(1) data base's configuration: set data base's configuration of host side and distributed reptile end, enables the database server of host side to accept the remote data storage of carrying out and the renewal of distributed reptile machine end.
(2) program execution automatization module:
1. program reads and captures list of websites list, crawls the profile list of website required for therefrom extracting;
2. program reading frequency setting, includes whether to carry out the most circulating permanent crawl, if do not circulated, needs the interval time having a rest between the number of times of earning, and each crawl;
3. a website in list is captured by program;
4. after the crawl of websites all in list is finished by program, rest a period of time, then the website of this list is captured;
5. program is to proposed in 4, may carry out non-stop crawl, it is also possible to captures and the most just stopped;
(3) program execution distributed reptile module:
1. program reads the configuration information of single website, including capturing frequency, captures and prepares, captures content, and data base configures;
2. program is according to capturing defined in frequency module, reads and starts the page. the step-length that each stage captures, the step number that each stage captures, the number of times in each stage of circulation, need the time paused inside each circulation;
3. program reads and captures preparation module, and analog subscriber uses http or https to send request to website;
4. program reads and captures content: by return information carries out a secondary data process, mainly ground floor data grabber;
5. program reads and captures content: website is carried out second layer data grabber;
6. the content captured is screened by program: includes filling missing values, filters out some redundant content etc.;
7. program is according to data base configuration file, and content is saved in the lane database of local data base or server end;
8. program rest a period of time, carry out crawling of next website according to the step number in frequency configuration and step width;
9. when step number reaches predetermined value, according to the frequency in frequency configuration, this website being carried out crawl again, starting point is the webpage just terminated.
The invention has the beneficial effects as follows and the data pick-up of the required extraction within each website out can be integrated into oneself website, be provided to user and retrieve, facilitate user to surf the Net inquiry.
Accompanying drawing explanation
Fig. 1 is the system architecture diagram of the data integrated system according to the present invention.
The Organization Chart of Tu2Shi automatization of the present invention module.
The flow chart of Tu3Shi automatization of the present invention module.
Fig. 4 is each internal module figure of distributed reptile of the present invention.
Fig. 5 is the flow chart of distributed reptile of the present invention.
Detailed description of the invention
One, data base arranges module:
(1) database server at server end increases order line, is allowed to allow the ip of distributed reptile place computer, and the teledata that this server end place database server is carried out by an its own specific user name writes and updates.
(2) at reptile end computer, remote data base log-in module, set the remote server needing to log in, the IP address of its computer and data base's port, needing the name of the data base logged in, need the user name used, the server end in i.e. the 1st article allows this specific user name, need the password logged in, need that title logged in.
(3) in reptile end computer, local data base log-in module, IP address and data base's port of the local computing that needs log in is set, need the name of the data base logged in, need the user name used, usually Root, need the password logged in, need the title of that form logged in.
Two, program performs automation equipment and arranges module:
Set and need automatization listing file list.xml,
1.site: the configuration filename of the website that needs crawl and position
2.for_ever_run: determine whether that carrying out permanent loops state captures
3.query_frequence: if not carrying out the crawl of permanent loops state, then the number of times captured
4.sleep_time: after every time having captured the website of a list, the length of one's sleep carried out
Three, program execution distributed reptile module:
(1) set reptile crawls frequency
1.start_page: reptile starts first page captured
2.query_frequence: capture the frequency of the page
The page needing the page captured to stride forward to the next one after 3.query_step_interval: one webpage capture is spaced
4.query_step_times: each round captures the number of steps strided forward of the page
Time of having a rest after the crawl of 5.sleep_time: each round
(2) preparatory stage of reptile is set:
1.charset: capture the coded format of website
2.query_page_front: capture the front end of link, do not comprise the part of the page number
3.query_page_end: capture the rear end of link, do not comprise the part of the page number
(3) the extraction content of reptile is set:
1.filter_list: filter list
Node: each node needing to put into data base's list
Name: put into the title of list
Default_value: if value disappearance, use this value to fill missing values
Regex: use regular language that the value taken out is filtered, the information filtering comprised is fallen.
2.query_level_nodes: need the node list of the content captured
First_level_nodes: ground floor needs the node list captured
Top_node: the path of the top of the node of required extraction, describes with the grammer of css/html, jquery
Common_path: path, basis, is the situation of relative path for some information extracted, and relative path completion can be absolute path by the content of this node
Nodes: i.e. need to extract the set of node of content,
Node: each node needing Extracting Information
Name: need to extract the name of content
Select: needing to extract the relative path of the select of content, every layer separates with $
Attribute: need to extract the attribute part of content
Text: need the text part extracting node the need of being extracted
Seconde_level_node_list: needs carry out the list of the second layer minor node captured
Node: need to carry out the list node that second time captures, captures from ground floor and chooses the list of node
Seconde_level_nodes: need the second layer minor node list carrying out capturing
The top mode needing to capture of top_seconde_level_node: the second layer
Nodes: need the second layer set of node captured
Node: second layer node
Name: need the name of the node of extraction
Select: need the select path of the node of extraction
Attribute: need the attribute of the node of extraction
Text: need the text part of the node of extraction
(4) set data base and configure content:
1. local data base configuration:
Database_connect: whether connect local data base
Hostname: connect the host of data base
Database_name: connect wide area information server name
Userid: connect the user name of data base
Password: connect the password of data base
Table_name: need to write the title of the form of data
2. remote data base configuration:
Remote_database_connect: whether log in remote data base
Remote_hostname: the host of the remote data base logged in:
Data base's name of remote_database_name: the remote data base logged in
Remote_userid: log in the user name that remote data base uses
Remote_password: log in the password that remote data base uses
Remote_table_name: log in the certain table name inside remote data base.

Claims (10)

1.Web field distributed real-time data extraction system, it is characterised in that comprise the following steps:
Read the information inside the website list of file names needing extraction;
According to information, according to certain frequency, website (forum, portal website etc.) is carried out data pick-up, the language css/html of employing Or jquery form;
The content of extraction being carried out single treatment, fills a vacancy including missing values, filter is arranged;
According to configuration file, the content after processing is sent to locally or remotely in data base.
A kind of data integrating method the most according to claim 1, needs the website list of file names captured to write on construct.xml and joins Put in file, during the configuration file of website to be crawled is all according to list writes on list.xml one by one, including Site stores the path of each website configuration file construct.xml.
A kind of reptile frequency adjustment method the most according to claim 1, if using circulation system, will set if do not used The number of times captured and the time of having a rest every time captured.
Frequency regulation block the most according to claim 2, arranges the beginning page including start the page, the step-length every time captured, often The step number that wheel captures, captures the number of round, and needs certain time of having a rest after captured every time.
The crawl preparatory stage the most according to claim 2, the coded system of captured website will capture network address leading portion, back segment and page Code is spliced into the mode capturing link.
Content extraction detailed rules and regulations aspect the most according to claim 2, ground floor, the crawl of second layer content arranges top_node, Representated by it, path select $ separates.
Content extraction detailed rules and regulations the most according to claim 2, each node, include name, select;Attribute, text Four options, wherein select indicates the path for select, and it meets css/html or jquery form, the road of every layer Footpath uses $ to separate, and fills in the name of the node needing extraction in attribute, and whether text explanation selects text.
Content extraction detailed rules and regulations the most according to claim 2, the list of the node of second layer extraction is inside second layer list node.
The most according to claim 3, the content of the first second layer node that extraction comes up needs, through a filter mechanism, to comprise this The name of node, automatic Filling power during disappearance, automatic Filling power during for filling this value disappearance, filters content, i.e. extracts The value come up filters, it would be desirable to the value regex form filtered writes in this frame.
Data base the most according to claim 4 configures detailed rules and regulations, is divided into two parts, and one is remote data base configuration, including being The no remote data base that logs in, the name of remote data base, long-range main frame, long-range user name, long-range password, long-range table Lattice name, another is local data base configuration, includes whether to log in local data base, local host, local data library name, User name, password, list name.
CN201410842253.3A 2015-02-06 2015-02-06 Web field distributed real time extraction system Pending CN105988994A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410842253.3A CN105988994A (en) 2015-02-06 2015-02-06 Web field distributed real time extraction system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410842253.3A CN105988994A (en) 2015-02-06 2015-02-06 Web field distributed real time extraction system

Publications (1)

Publication Number Publication Date
CN105988994A true CN105988994A (en) 2016-10-05

Family

ID=57035491

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410842253.3A Pending CN105988994A (en) 2015-02-06 2015-02-06 Web field distributed real time extraction system

Country Status (1)

Country Link
CN (1) CN105988994A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106778750A (en) * 2017-01-17 2017-05-31 深圳市海云天科技股份有限公司 A kind of answering card identification point accurate positioning method and system
CN110309118A (en) * 2018-03-06 2019-10-08 北京询达数据科技有限公司 A kind of design method of depth network data excavation robot
CN110309467A (en) * 2018-03-25 2019-10-08 北京询达数据科技有限公司 A kind of design method of Full-automatic deep Web Mining machine
CN113392298A (en) * 2020-03-14 2021-09-14 天津挺哥网络科技有限公司 Design method of fixed-point darknet information mining analysis system based on xml database

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2006058075A2 (en) * 2004-11-22 2006-06-01 Truveo, Inc. Method and apparatus for an application crawler
CN101251852A (en) * 2008-01-11 2008-08-27 孟小峰 Integrating system and method of Web data facing to field
CN101840432A (en) * 2010-05-11 2010-09-22 同方知网(北京)技术有限公司 Data mining device based on Deep Web deep dynamic data and method thereof
CN104317845A (en) * 2014-10-13 2015-01-28 安徽华贞信息科技有限公司 Method and system for automatic extraction of deep web data

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2006058075A2 (en) * 2004-11-22 2006-06-01 Truveo, Inc. Method and apparatus for an application crawler
CN101251852A (en) * 2008-01-11 2008-08-27 孟小峰 Integrating system and method of Web data facing to field
CN101840432A (en) * 2010-05-11 2010-09-22 同方知网(北京)技术有限公司 Data mining device based on Deep Web deep dynamic data and method thereof
CN104317845A (en) * 2014-10-13 2015-01-28 安徽华贞信息科技有限公司 Method and system for automatic extraction of deep web data

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106778750A (en) * 2017-01-17 2017-05-31 深圳市海云天科技股份有限公司 A kind of answering card identification point accurate positioning method and system
CN106778750B (en) * 2017-01-17 2020-02-04 深圳市海云天科技股份有限公司 Method and system for accurately positioning answer sheet identification points
CN110309118A (en) * 2018-03-06 2019-10-08 北京询达数据科技有限公司 A kind of design method of depth network data excavation robot
CN110309467A (en) * 2018-03-25 2019-10-08 北京询达数据科技有限公司 A kind of design method of Full-automatic deep Web Mining machine
CN113392298A (en) * 2020-03-14 2021-09-14 天津挺哥网络科技有限公司 Design method of fixed-point darknet information mining analysis system based on xml database

Similar Documents

Publication Publication Date Title
CN103823824B (en) A kind of method and system that text classification corpus is built automatically by the Internet
CN103475687B (en) Distributed method and system for download site data
CN105988994A (en) Web field distributed real time extraction system
US6810414B1 (en) System and methods for easy-to-use periodic network data capture engine with automatic target data location, extraction and storage
US7882099B2 (en) System and method for focused re-crawling of web sites
CN105320740B (en) The acquisition methods and acquisition system of wechat article and public platform
CN103294732B (en) Webpage capture method and reptile
CN103546326B (en) Website traffic statistic method
CN102567407B (en) Method and system for collecting forum reply increment
CN101251852B (en) Integrating system and method of Web data facing to field
CN102760151A (en) Implementation method of open source software acquisition and searching system
CN106126648A (en) A kind of based on the distributed merchandise news reptile method redo log
CN103685228B (en) Website vulnerability rapid scanning method and device
CN108052632A (en) A kind of method for obtaining network information, system and company information search system
CN108197312A (en) Obtain source of houses data method, device, equipment and readable storage medium storing program for executing
CN106503223A (en) A kind of binding site and the online source of houses searching method and device of key word information
CN106354843A (en) Web crawler system and method
AU2006244114A1 (en) Processing information from selected sources via a single website
CN103258017A (en) Method and system for parallel square crossing network data collection
CN103853770B (en) The method and system of model content in a kind of extraction forum Web pages
CN108062468B (en) Network crawler method based on picture identifying code identification
CN105760545A (en) Configuration rule based website data search method
CN110287603A (en) A kind of circuit auxiliary design method and device based on CBB circuit library
CN108205548A (en) A kind of Web Spider structure and its method of work based on agriculture webpage information acquisition
CN108984629A (en) A kind of model training method and system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20161005

RJ01 Rejection of invention patent application after publication