CN103838786A - Web data automatic collecting method - Google Patents

Web data automatic collecting method Download PDF

Info

Publication number
CN103838786A
CN103838786A CN201210490953.1A CN201210490953A CN103838786A CN 103838786 A CN103838786 A CN 103838786A CN 201210490953 A CN201210490953 A CN 201210490953A CN 103838786 A CN103838786 A CN 103838786A
Authority
CN
China
Prior art keywords
web
robot
document
search
url
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201210490953.1A
Other languages
Chinese (zh)
Inventor
苏晓华
李勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
DALIAN LINGDONG TECHNOLOGY DEVELOPMENT Co Ltd
Original Assignee
DALIAN LINGDONG TECHNOLOGY DEVELOPMENT Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by DALIAN LINGDONG TECHNOLOGY DEVELOPMENT Co Ltd filed Critical DALIAN LINGDONG TECHNOLOGY DEVELOPMENT Co Ltd
Priority to CN201210490953.1A priority Critical patent/CN103838786A/en
Publication of CN103838786A publication Critical patent/CN103838786A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a Web data automatic collecting method which comprises the following steps that a network robot technology and a webpage data extracting technology are used; the network robot technology comprises the steps of network robot working procedure designing, network robot designing principle making, a depth-first search strategy, a breadth-first search strategy, a network trap, equilibrium access and hyperlink extracting; and the webpage data extracting technology comprises webpage pure text extracting and analyzing and processing on special characters in a text. According to the Web data automatic collecting method, the network robot technology and the webpage data extracting technology are fully used, the Web automatic collecting method is formed, valuable data are collected from mass information and are subjected to analyzing and researching, the foundations for various decisions of enterprises are formed, a problem of data collecting personnel and market study personnel is solved, meanwhile, Web usability is widened, and certain contribution is made to development of data collecting, especially to automatic data collecting.

Description

A kind of method of Web automatic data collection
Technical field
The present invention relates to a kind of data acquisition technology, particularly a kind of method of Web automatic data collection.
Background technology
Along with enriching constantly and the continuous expansion of network information of Internet resources, people are more and more stronger to the dependence of network, find fast own required specific resources to bring inconvenience to also service object from vast as the open sea Internet resources; Information just has unlimited value from ancient times, along with the development in epoch, the mankind have come the information age unconsciously, all trades and professions have all been full of countless information, and the value of information is just the circulation of data, if data can circulate timely and transmit, the incomparable value that competence exertion information is real; At Under the market economy condition, image data has become important instrument and means.
How from magnanimity information, collecting valuable data and to analyze and research, to form the foundation of the various decision-makings of enterprise, is the problem that data acquisition personnel and market researcher face; Will from a large amount of data, find rapidly and obtain own needed information and service, become more and more difficult, service object tends to lose their target or obtain some more biased results in the time of Query Information; Data must could produce value through gathering, integrate, analyzing, and scattered information can only be Improving News, cannot embody real commercial value; For enterprise and information analysis personnel, to in a large amount of information, filter out effective value point on the one hand, reduce again the cost that obtains corresponding information simultaneously, make the actual use value of information be greater than the cost that the processes such as collection, analytical information produce, making information is that the decision-making of enterprise brings added value.
Popularizing of internet, the development of infotech, has formed a large amount of information resources; From the information of magnanimity, extract useful resource, it is current problem in the urgent need to address, and the expressed main information of the Web page is hidden in a large amount of irrelevant structures and word conventionally, make user can not obtain rapidly subject information, limited the availability of Web, Web automatically gathers and contributes to address this problem, automatically it is time saving and energy saving to gather, information broad covered area, but information extraction is of low quality, thus will affect precision ratio; So most data collection task all adopts automatic acquisition mode now; Automatic acquisition technology produces under this background.
Summary of the invention
The present invention is directed to the proposition of above problem, and develop a kind of method of the Web automatic data collection by network robot technology and applying web page data abstraction techniques.
Technological means of the present invention is as follows:
A method for Web automatic data collection, is characterized in that comprising the following steps:
A, network robot technology:
A1, planned network robot workflow: robot is conducted interviews to corresponding WWW document for browsing starting point with one or one group of URL, and described WWW document is html document;
A2, formulation network robot principle of design;
A21, formulate robot project standard not to be covered: on server, create a robot text, link that website can not be accessed and the robot of website denied access are described in text file;
A22, the formulation META of robot label: user adds a META label in the page, whether this META label allows owner's appointment of a page allow robot program to carry out index pages or from the page, extract link;
A3, depth-first search strategy and breadth first search;
A31, depth-first search strategy are from start node, after first document is analyzed, fetch first and link the page pointed, after this page is analyzed, fetch again its first link document pointed, repeatedly carry out until search the document that does not comprise any hyperlink, be defined as a complete chain, then return to a certain document, continue to select all the other hyperlinks in the document, the mark that search finishes is that whole hyperlinks have been searched for complete;
A32, breadth first search are after first document is analyzed, by complete all hyperlink search in this Web page, then the search of the lower one deck of continuation, until the search of the bottom completes;
A4, network trap;
A41, before the new URL of access with to be searched and searched for URL the URL in row list is compared, this is relatively the comparison between URL object, the URL not comprising is added to URL to url list to be searched in row list, to avoid falling into network trap;
A42, ignore all URL that are provided with parameter while extracting the hyperlink of Web document;
A43, the restriction robot searches degree of depth; Stop downward search when arriving after the threshold search degree of depth, wherein often enter into next stage sublink and shown to arrive a new search depth; Or set the maximum time length of access Web server, start timing in the time of first webpage of this Web server of bot access, through after maximum time length, the robot program who creeps on server disconnects the all-links with this server at once;
A5, balanced access; Set the thread maximum number of a Web server of access and adopt waiting mode restriction robot program or the access frequency of process to particular server and the network segment; Whenever robot program or process are from a Web website is obtained a document, this robot program or process are carried out new access to this Web website again by the certain interval of wait, determine the length of stand-by period according to website processing power and network communication ability, the time T 1 of next time accessing this Web website adds the access required time of this Web website for current time T2, accesses the required time value of this Web website and is network latency T3 and be multiplied by and set coefficient;
A6, hyperlink are extracted; Robot program continues the corresponding Web source document of the link obtaining to carry out data acquisition in obtaining URL link, and Web source document is converted to the form of character stream;
B, web data extractive technique;
The extraction of B1, webpage plain text; The html source file obtaining is carried out filtration treatment and deletes label instruction character extraction text message wherein, unified web data character format after filtering web page data;
B2, the special character in text is analyzed and processed.
Owing to having adopted technique scheme, the method of a kind of Web automatic data collection provided by the invention, make full use of network robot technology and web data extractive technique, form Web automatic acquiring method, from magnanimity information, collect valuable data and analyze and research, form the foundation of the various decision-makings of enterprise, solve the problem that data acquisition personnel and market researcher face, expanded the availability of Web simultaneously, to data acquisition, especially certain contribution has been made in the development of automatic data acquisition.
Accompanying drawing explanation
Fig. 1 is network robot workflow diagram of the present invention;
Fig. 2 is the workflow diagram that html web page plain text of the present invention extracts.
Embodiment
Network robot is a kind of software program that can utilize hyperlink in Web document recursively to access new document; Automatically collection mechanism is to utilize one to be network robot be that the software of the search of Robot automatically gathers and joins in index database website and webpage according to certain rule;
The method of a kind of Web automatic data collection as shown in Figure 1 and Figure 2, comprises the following steps:
A, network robot technology:
A1, planned network robot groundwork flow process first, specifically describe as Robot with one or one group of URL as browsing starting point, and to corresponding WWW document its groundwork flow process that conducts interviews, described WWW document is generally html document;
A2, formulation principle of design;
A21, Robots Exclusion standard, be on server, to create a Robots.txt file, illustrates which Robot access which link inaccessible of our station and our station refuse;
A22, Robots META mark are that user can add a META mark in the page of oneself; Whether Robot META mark allows owner's appointment of a page allow Robot program to carry out the page of index oneself or from this page, extract link;
A3, depth-first search strategy and breadth first search;
A31, depth-first search strategy are from start node, after first document is analyzed, fetch first and link the page pointed, then analyze this page, fetch again its first link document pointed, repeatedly carry out, till searching those documents that do not comprise any hyperlink always, this calculates a complete chain, and then return to a certain document, continue to select other hyperlink in the document, the mark that it finishes is to no longer include other hyperlink can search for again;
A32, breadth first search are, after first document is analyzed, first to have searched for all hyperlinks in this Web page, and then continue the search of lower one deck, until the bottom;
Current, in website, the institutional framework of the Web page is directly determining the preference strategy that deviser adopts; Because robot determines search strategy in the mode of url list access, so its key issue is that we regard queue to be searched as queue or storehouse in realization; If regard queue as, new hyperlink adds from tail, from the beginning takes out and forms breadth First traversal; If regard storehouse as, from the beginning from the beginning new hyperlink add takes out and forms depth-first traversal;
A4, network trap;
A41, should be first before the new URL of access and to be searched and searched for URL the URL in row list is compared, only have brand-new URL just can join url list to be searched, so just can avoid falling into network trap; Should note in realization this is relatively comparison between comparison rather than character string between URL object, will avoid the problem of the corresponding same main frame of multiple different URL character strings;
A42, in the time that extracting, the hyperlink of WEB document ignores all URL with parameter;
A43, in the time of actual Robot search, must limit the degree of depth of search; Often enter next stage sublink and just represent to have arrived a new degree of depth, when arriving after the threshold depth of regulation, just stop down searching for again; Or can set the maximum time length of access Web server, first webpage of accessing this Web server as Robot starts timing, through after maximum time length, the Robot program of creeping on server disconnects the all-links with this Web server at once;
A5, balanced access; Reply Web server only uses a few thread accesses; In the time designing program, the maximum number of the thread of a Web server of regulation access, is so just restricted the Thread Count of accessing a Web server; In addition, must limit Robot program or the access frequency of process to particular server and the network segment, its basic skills is " wait "; Whenever Robot program or process are from a Web website is obtained a document, it must wait for that certain interval carries out new access to this Web website again, and the time length of wait is generally determined according to the ability to communicate of the processing power of website and network; Common design is to access the time T of this Web website 1 to access the required time of this Web website for current time T2 adds next time, accessing the required time of this Web website and be mainly T3 that Internet Transmission is taken time is multiplied by one and has set coefficient " good person's coefficient good-guyfactor ", that is: T1=T2+T3*good-guyfactor;
A6, hyperlink are extracted; Wherein concentrate the extracting method of text hyperlink is described, as follows at the grammatical form of html document Chinese version hyperlink:
<A HREF=hyperlink URL address portion > hyperlink display text declaratives </A>
The target that hyperlink is extracted is the hyperlink URL address portion obtaining wherein; Simple search procedure is first by all unified upper case or lower cases of the character of html source file, then locate " HREF " mark after " <A " mark in document, after finding, the link of following is thereafter analyzed, only preserved as webpage format and the not links with parameter such as " .htm ", " .html ", " shtml ", " .jsp ", " .asp " and " .php "; Repeat said process until handle " HREF " mark after all " <A " marks in document; Robot program will constantly be carried out data acquisition to the corresponding WEB source document of the link obtaining in obtaining URL link, to obtain more WEB link and data; Should be converted in realization the form of character stream for demonstration that can be more accurate;
B, web data extractive technique: it has determined efficiency and the quality of information acquisition to a great extent;
The extraction of B1, webpage plain text; First the html source file obtaining is carried out to filtration treatment and extract text message with the Tag instruction character removing wherein; Can on html source file, be handled as follows all " < " marks and " > " mark in realization: the first position of location " < " mark, relocate the position of adjacent thereafter " > " mark, then remove two character strings between position; Or first the position of location " > " mark, relocates the position that adjacent thereafter " < " identifies, the then character string between cumulative two positions; Scripted code has the feature of text described above, so should note getting rid of it in the time extracting text; A kind of mode of eliminating is, in the time that HTML is resolved, start label if run into <script>, just can find </script> end-tag at once, then proceed to resolve thereafter; Another kind of method for removing is tentatively it to be worked as to composition notebook to extract, and then judges whether it is scripted code, if script just will not be collected; Text in a webpage is stored, between each text separating, should be added separator; In the time of actual treatment text, need label be divided into two classes according to the meaning of label: a class is dividing label, another kind of is not dividing label; A rear class label comprises: <A><BGreatT.Grea T.GT<I><EMGreatT .GreaT.GT<T2><BI G><SUB>LEssT.LT ssT.LTSMALL><STRONGGreatT.Gre aT.GT<STRIKE><BR > etc.; This class label does not play compartmentation semantically, occur that such label should think that two texts are continuous between two texts; After web data filters, the form of unified web data character;
B2, processing special character; Some special character appears in text, and therefore text our main object to be processed just will first carry out analyzing and processing to these special characters before processing text; Such as " & copy in html document; All rights reserved & copy; " in browser, will be shown as
Figure BDA00002477194300061
all rights reserved
Figure BDA00002477194300062
we may cause mess code phenomenon while using high-level programming language to resolve HTML; So we must resolve special character ourselves.
The method of a kind of Web automatic data collection provided by the invention, make full use of network robot technology and web data extractive technique, form Web automatic acquiring method, from magnanimity information, collect valuable data and analyze and research, the foundation that forms the various decision-makings of enterprise, has solved the problem that data acquisition personnel and market researcher face, and has expanded the availability of Web simultaneously, to data acquisition, especially certain contribution has been made in the development of automatic data acquisition.
The above; it is only preferably embodiment of the present invention; but protection scope of the present invention is not limited to this; any be familiar with those skilled in the art the present invention disclose technical scope in; be equal to replacement or changed according to technical scheme of the present invention and inventive concept thereof, within all should being encompassed in protection scope of the present invention.

Claims (1)

1. a method for Web automatic data collection, is characterized in that comprising the following steps:
A, network robot technology:
A1, planned network robot workflow: robot is conducted interviews to corresponding WWW document for browsing starting point with one or one group of URL, and described WWW document is html document;
A2, formulation network robot principle of design;
A21, formulate robot project standard not to be covered: on server, create a robot text, link that website can not be accessed and the robot of website denied access are described in text file;
A22, the formulation META of robot label: user adds a META label in the page, whether this META label allows owner's appointment of a page allow robot program to carry out index pages or from the page, extract link;
A3, depth-first search strategy and breadth first search;
A31, depth-first search strategy are from start node, after first document is analyzed, fetch first and link the page pointed, after this page is analyzed, fetch again its first link document pointed, repeatedly carry out until search the document that does not comprise any hyperlink, be defined as a complete chain, then return to a certain document, continue to select all the other hyperlinks in the document, the mark that search finishes is that whole hyperlinks have been searched for complete;
A32, breadth first search are after first document is analyzed, by complete all hyperlink search in this Web page, then the search of the lower one deck of continuation, until the search of the bottom completes;
A4, network trap;
A41, before the new URL of access with to be searched and searched for URL the URL in row list is compared, this is relatively the comparison between URL object, the URL not comprising is added to URL to url list to be searched in row list, to avoid falling into network trap;
A42, ignore all URL that are provided with parameter while extracting the hyperlink of Web document;
A43, the restriction robot searches degree of depth; Stop downward search when arriving after the threshold search degree of depth, wherein often enter into next stage sublink and shown to arrive a new search depth; Or set the maximum time length of access Web server, start timing in the time of first webpage of this Web server of bot access, through after maximum time length, the robot program who creeps on server disconnects the all-links with this server at once;
A5, balanced access; Set the thread maximum number of a Web server of access and adopt waiting mode restriction robot program or the access frequency of process to particular server and the network segment; Whenever robot program or process are from a Web website is obtained a document, this robot program or process are carried out new access to this Web website again by the certain interval of wait, determine the length of stand-by period according to website processing power and network communication ability, the time T 1 of next time accessing this Web website adds the access required time of this Web website for current time T2, accesses the required time value of this Web website and is network latency T3 and be multiplied by and set coefficient;
A6, hyperlink are extracted; Robot program continues the corresponding Web source document of the link obtaining to carry out data acquisition in obtaining URL link, and Web source document is converted to the form of character stream;
B, web data extractive technique;
The extraction of B1, webpage plain text; The html source file obtaining is carried out filtration treatment and deletes label instruction character extraction text message wherein, unified web data character format after filtering web page data;
B2, the special character in text is analyzed and processed.
CN201210490953.1A 2012-11-27 2012-11-27 Web data automatic collecting method Pending CN103838786A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210490953.1A CN103838786A (en) 2012-11-27 2012-11-27 Web data automatic collecting method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210490953.1A CN103838786A (en) 2012-11-27 2012-11-27 Web data automatic collecting method

Publications (1)

Publication Number Publication Date
CN103838786A true CN103838786A (en) 2014-06-04

Family

ID=50802295

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210490953.1A Pending CN103838786A (en) 2012-11-27 2012-11-27 Web data automatic collecting method

Country Status (1)

Country Link
CN (1) CN103838786A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104361061A (en) * 2014-11-03 2015-02-18 烽火通信科技股份有限公司 WEB page information sensing and collecting method
CN105607895A (en) * 2014-11-21 2016-05-25 阿里巴巴集团控股有限公司 Operation method and device of application program on the basis of application program programming interface
CN106385345A (en) * 2016-09-23 2017-02-08 北京锐安科技有限公司 Method and apparatus for acquiring network data
CN113157730A (en) * 2021-04-26 2021-07-23 中国人民解放军军事科学院国防科技创新研究院 Civil-military fusion policy information system

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101404666A (en) * 2008-10-06 2009-04-08 赵洪宇 Infinite layer collection method based on Web page
JP2012168844A (en) * 2011-02-16 2012-09-06 Yahoo Japan Corp Retrieval suggestion device and method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101404666A (en) * 2008-10-06 2009-04-08 赵洪宇 Infinite layer collection method based on Web page
JP2012168844A (en) * 2011-02-16 2012-09-06 Yahoo Japan Corp Retrieval suggestion device and method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
刘峰: ""通用中英文专业搜索引擎技术的研究及应用"", 《中国优秀硕士学位论文全文数据库(信息科技辑)》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104361061A (en) * 2014-11-03 2015-02-18 烽火通信科技股份有限公司 WEB page information sensing and collecting method
CN104361061B (en) * 2014-11-03 2018-02-16 南京烽火星空通信发展有限公司 A kind of WEB page information Perception acquisition method
CN105607895A (en) * 2014-11-21 2016-05-25 阿里巴巴集团控股有限公司 Operation method and device of application program on the basis of application program programming interface
CN105607895B (en) * 2014-11-21 2021-03-02 阿里巴巴集团控股有限公司 Application program operation method and device based on application program programming interface
CN106385345A (en) * 2016-09-23 2017-02-08 北京锐安科技有限公司 Method and apparatus for acquiring network data
CN113157730A (en) * 2021-04-26 2021-07-23 中国人民解放军军事科学院国防科技创新研究院 Civil-military fusion policy information system

Similar Documents

Publication Publication Date Title
CN103544255B (en) Text semantic relativity based network public opinion information analysis method
Chakrabarti et al. Focused crawling: a new approach to topic-specific Web resource discovery
CN101231661B (en) Method and system for digging object grade knowledge
Patil Swati et al. Search engine optimization: A study
Yu et al. Summary of web crawler technology research
US20080168041A1 (en) System and method for focused re-crawling of web sites
CN103927397B (en) Recognition method for Web page link blocks based on block tree
CN101630327A (en) Design method of theme network crawler system
CN103226578A (en) Method for identifying websites and finely classifying web pages in medical field
CN103838785A (en) Vertical search engine in patent field
CN102591992A (en) Webpage classification identifying system and method based on vertical search and focused crawler technology
CN101908071A (en) Method and device thereof for improving search efficiency of search engine
Prajapati A survey paper on hyperlink-induced topic search (HITS) algorithms for web mining
CN101630330A (en) Method for webpage classification
CN103838786A (en) Web data automatic collecting method
Liu et al. Topical Web Crawling for Domain-Specific Resource Discovery Enhanced by Selectively using Link-Context.
Priyatam et al. Domain specific search in indian languages
CN112597370A (en) Webpage information autonomous collecting and screening system with specified demand range
Cheng et al. Efficient focused crawling strategy using combination of link structure and content similarity
CN108959576A (en) A kind of network crawler system and method based on Party school&#39;s research work theme
Brown et al. ILAS: Intrinsic landscape assessment system for landscape design and planning in the national capital region
Hati et al. Improved focused crawling approach for retrieving relevant pages based on block partitioning
Ma et al. Searching Tourism Information by Using Vertical Search Engine Based on Nutch and Solr
Smith Does metadata count? A Webometric investigation
Pembe et al. Heading-based sectional hierarchy identification for HTML documents

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20140604