CN101441629A - Automatic acquiring method of non-structured web page information - Google Patents
Automatic acquiring method of non-structured web page information Download PDFInfo
- Publication number
- CN101441629A CN101441629A CNA2007101706017A CN200710170601A CN101441629A CN 101441629 A CN101441629 A CN 101441629A CN A2007101706017 A CNA2007101706017 A CN A2007101706017A CN 200710170601 A CN200710170601 A CN 200710170601A CN 101441629 A CN101441629 A CN 101441629A
- Authority
- CN
- China
- Prior art keywords
- web page
- page information
- information
- acquiring method
- website
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to a method for the automatic acquisition of information of a nonstructural webpage. The method comprises the following steps: (1) a spider acquisition computer system reads a website link table from a data storage device; (2) whether a website to be acquired is in the website link table is detected; if not, acquisition is finished; (3) if a detection result of the step two is yes, a resolution rule corresponding to the website to be acquired is selected; (4) at least one thread is established; and at least one thread resolves a page of the current website through the selected resolution rule; and (5) after resolution is finished, webpage information and acquisition state information needing to be stored are stored in the data storage device; and the step two is returned. Compared with the prior art, the method extracts the nonstructural information from source pages of various websites and stores the information in a system of a structural database; and through the method, mass manpower and fund can be saved in terms of information acquisition and integration.
Description
Technical field
The present invention relates to technical field of the computer network, particularly a kind of automatic acquiring method of non-structured web page information.
Background technology
In present life, acquisition system is widely used in industry portal website, CIS, Knowledge Management System, web site contents system, fields such as scientific research.So-called acquisition system extracts the system that is saved in the structurized database with non-structured information exactly from the page of source, various website.
Summary of the invention
Technical matters to be solved by this invention is exactly the automatic acquiring method that a kind of non-structured web page information is provided for the defective that overcomes above-mentioned prior art existence.
Purpose of the present invention can be achieved through the following technical solutions: a kind of automatic acquiring method of non-structured web page information, it is characterized in that, and may further comprise the steps:
1) spider collecting computer system reads the website links table from data storage device;
2) detect in the website links table whether network address to be collected is arranged, if not, then gather and finish;
3) if step 2) testing result for being then to select decomposition rule corresponding to network address to be collected;
4) create at least one thread, this at least one thread decomposes the page of current network address by selected decomposition rule;
5) after decomposition was finished, info web and acquisition state information that need are preserved were saved in data storage device, and return step 2).
Described non-structured web page information comprises title, description, picture.
Described decomposition rule adopts regular expression.
Between described spider collecting computer system and the web station system to be collected by the mutual communication of URL(Uniform Resource Locator) (Url) mode based on the Http agreement.
Compared with prior art, the present invention extracts the system that is saved in the structurized database with non-structured information from the page of source, various website, by method of the present invention, can save great amount of manpower and fund aspect the information gathering integration.
Description of drawings
Fig. 1 is a process flow diagram of the present invention;
Fig. 2 is a schematic diagram of the present invention.
Embodiment
The utility model is described in further detail below in conjunction with accompanying drawing.
As shown in Figure 1, 2, a kind of automatic acquiring method of non-structured web page information may further comprise the steps:
1) spider collecting computer system reads the website links table from data storage device;
2) detect in the website links table whether network address to be collected is arranged, if not, then gather and finish;
3) if step 2) testing result for being then to select decomposition rule corresponding to network address to be collected;
4) create at least one thread, this at least one thread decomposes the page of current network address by selected decomposition rule;
5) after decomposition was finished, info web and acquisition state information that need are preserved were saved in data storage device, and return step 2).
Described non-structured web page information comprises title, description, picture; Described decomposition rule adopts regular expression; Between described spider collecting computer system and the web station system to be collected by the mutual communication of URL(Uniform Resource Locator) (Url) mode based on the Http agreement.
Set up a cover spider collecting computer system in network server end, set up a cover and preserve the data storage device that collects; At each website image data, by URL(Uniform Resource Locator) (URL) mode based on HTTP(Hypertext Transport Protocol) realize each other communication between spider collecting computer system and the web station system that need to gather by network.
Wherein:
Data storage device is used to store data, and spider foreground capture program is from obtaining the lists of links of be about to gathering here, and corresponding acquisition state also can be updated in here.
Spider collecting computer system is used to handle each link, downloads to connect the concrete page, and judge and adopt any component to separate rule, and real the decomposition.
In the present embodiment, the spider acquisition system at first can be read in the collection lists of links (source tabulation) that is kept in advance in the storage system, information in the tabulation is link and the current state that specifically needs collection, after gathering beginning, at first can start thread, follow concrete thread and will judge that use which component separates rule according to the chained address of current collection according to the number of threads of setting and maximum preservation picture numbers of setting.Then, the spider acquisition system can be downloaded and decomposes the corresponding page source code according to employed decomposition rule, and present embodiment uses regular expression to decompose the various attributes preserved of being necessary, such as: title, description, picture tabulation or the like.So all-links has to the last been decomposed in circulation.In the above decomposable process, if figure is arranged then preserve these figure (in the catalogue of spider acquisition system place) in corresponding catalogue, and the data that collect can be saved in the storage system.
In said method, applied to the multithreading execution.This is that holding time is long because consider that the data volume of collection is many.Each thread sends each signal by entrusting, and expresses and upgrades, and main then interface is according to each element value and statistical information in the parameter update interface.After a thread had decomposed a link, the function that can call in the master routine obtained next chained address, had then to decompose, and did not have then to stop current thread.With this until last.
In said method, used regular expression to decompose attribute.This is because the utilization regular expression can be simplified many character string problems.
The website links that needs the needs stored to gather in the memory storage, this can be to be undertaken by other recording programs, also can be that other importing programs import.
Present embodiment can start corresponding thread and handle these links simultaneously, and each thread also can continue to handle successive links according to self finishing the decomposition situation, ends up to no follow-up being linked as.
Claims (4)
1. the automatic acquiring method of a non-structured web page information is characterized in that, may further comprise the steps:
1) spider collecting computer system reads the website links table from data storage device;
2) detect in the website links table whether network address to be collected is arranged, if not, then gather and finish;
3) if step 2) testing result for being then to select decomposition rule corresponding to network address to be collected;
4) create at least one thread, this at least one thread decomposes the page of current network address by selected decomposition rule;
5) after decomposition was finished, info web and acquisition state information that need are preserved were saved in data storage device, and return step 2).
2. the automatic acquiring method of a kind of non-structured web page information according to claim 1 is characterized in that, described non-structured web page information comprises title, description, picture.
3. the automatic acquiring method of a kind of non-structured web page information according to claim 1 is characterized in that, described decomposition rule adopts regular expression.
4. the automatic acquiring method of a kind of non-structured web page information according to claim 1, it is characterized in that, between described spider collecting computer system and the web station system to be collected by the mutual communication of URL(Uniform Resource Locator) (Url) mode based on the Http agreement.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNA2007101706017A CN101441629A (en) | 2007-11-19 | 2007-11-19 | Automatic acquiring method of non-structured web page information |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNA2007101706017A CN101441629A (en) | 2007-11-19 | 2007-11-19 | Automatic acquiring method of non-structured web page information |
Publications (1)
Publication Number | Publication Date |
---|---|
CN101441629A true CN101441629A (en) | 2009-05-27 |
Family
ID=40726067
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CNA2007101706017A Pending CN101441629A (en) | 2007-11-19 | 2007-11-19 | Automatic acquiring method of non-structured web page information |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN101441629A (en) |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102096705A (en) * | 2010-12-31 | 2011-06-15 | 南威软件股份有限公司 | Article acquisition method |
CN102118400A (en) * | 2009-12-31 | 2011-07-06 | 北京四维图新科技股份有限公司 | Data acquisition method and system |
CN102298588A (en) * | 2010-06-25 | 2011-12-28 | 株式会社理光 | Method and device for extracting object from non-structured document |
CN102915373A (en) * | 2012-11-06 | 2013-02-06 | 无锡江南计算技术研究所 | Data storage method and device |
CN102955795A (en) * | 2011-08-24 | 2013-03-06 | 句容今太科技园有限公司 | Web information acquisition system |
CN104462140A (en) * | 2013-09-24 | 2015-03-25 | 北大方正集团有限公司 | Webpage data collecting method and device |
CN104462566A (en) * | 2014-12-26 | 2015-03-25 | 中科宇图天下科技有限公司 | Environmental protection information grid capturing method |
CN104484485A (en) * | 2015-01-09 | 2015-04-01 | 携程计算机技术(上海)有限公司 | Ticket number information extracting and outputting method and system and ticket discharging system |
CN104965904A (en) * | 2015-06-30 | 2015-10-07 | 北京奇虎科技有限公司 | Multi-platform data grabbing method and apparatus |
CN105468664A (en) * | 2015-05-12 | 2016-04-06 | 北京众标网络科技有限公司 | Information acquisition method and apparatus |
CN105825399A (en) * | 2016-03-15 | 2016-08-03 | 焦点科技股份有限公司 | Internet based B2B e-commerce information collecting method |
CN107918674A (en) * | 2017-12-12 | 2018-04-17 | 携程旅游网络技术(上海)有限公司 | Acquisition method and its system, storage medium, the electronic equipment of web data |
-
2007
- 2007-11-19 CN CNA2007101706017A patent/CN101441629A/en active Pending
Cited By (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102118400A (en) * | 2009-12-31 | 2011-07-06 | 北京四维图新科技股份有限公司 | Data acquisition method and system |
CN102118400B (en) * | 2009-12-31 | 2013-07-17 | 北京四维图新科技股份有限公司 | Data acquisition method and system |
CN102298588A (en) * | 2010-06-25 | 2011-12-28 | 株式会社理光 | Method and device for extracting object from non-structured document |
CN102298588B (en) * | 2010-06-25 | 2014-04-30 | 株式会社理光 | Method and device for extracting object from non-structured document |
CN102096705A (en) * | 2010-12-31 | 2011-06-15 | 南威软件股份有限公司 | Article acquisition method |
CN102955795A (en) * | 2011-08-24 | 2013-03-06 | 句容今太科技园有限公司 | Web information acquisition system |
CN102915373B (en) * | 2012-11-06 | 2016-08-10 | 无锡江南计算技术研究所 | A kind of date storage method and device |
CN102915373A (en) * | 2012-11-06 | 2013-02-06 | 无锡江南计算技术研究所 | Data storage method and device |
CN104462140A (en) * | 2013-09-24 | 2015-03-25 | 北大方正集团有限公司 | Webpage data collecting method and device |
CN104462566A (en) * | 2014-12-26 | 2015-03-25 | 中科宇图天下科技有限公司 | Environmental protection information grid capturing method |
CN104462566B (en) * | 2014-12-26 | 2017-11-21 | 中科宇图天下科技有限公司 | A kind of environmental protection information grid grasping means |
CN104484485A (en) * | 2015-01-09 | 2015-04-01 | 携程计算机技术(上海)有限公司 | Ticket number information extracting and outputting method and system and ticket discharging system |
CN104484485B (en) * | 2015-01-09 | 2019-03-12 | 上海携程商务有限公司 | Draft bank information extraction deriving method, system and ticket system |
CN105468664A (en) * | 2015-05-12 | 2016-04-06 | 北京众标网络科技有限公司 | Information acquisition method and apparatus |
CN104965904A (en) * | 2015-06-30 | 2015-10-07 | 北京奇虎科技有限公司 | Multi-platform data grabbing method and apparatus |
CN104965904B (en) * | 2015-06-30 | 2019-06-04 | 北京奇虎科技有限公司 | A kind of grasping means of multi-platform data and device |
CN105825399A (en) * | 2016-03-15 | 2016-08-03 | 焦点科技股份有限公司 | Internet based B2B e-commerce information collecting method |
CN107918674A (en) * | 2017-12-12 | 2018-04-17 | 携程旅游网络技术(上海)有限公司 | Acquisition method and its system, storage medium, the electronic equipment of web data |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN101441629A (en) | Automatic acquiring method of non-structured web page information | |
CN107895009B (en) | Distributed internet data acquisition method and system | |
CN101651707B (en) | Method for automatically acquiring user behavior log of network | |
CN101370024B (en) | Distributed information collection method and system | |
CN105243159A (en) | Visual script editor-based distributed web crawler system | |
CN102662966B (en) | Method and system for obtaining subject-oriented dynamic page content | |
CN104050281A (en) | Webpage information extraction method and device based on http protocol | |
US8972374B2 (en) | Content acquisition system and method of implementation | |
CN102760151B (en) | Implementation method of open source software acquisition and searching system | |
CN109376291B (en) | Website fingerprint information scanning method and device based on web crawler | |
CN102521232B (en) | Distributed acquisition and processing system and method of internet metadata | |
CN103942210A (en) | Processing method, device and system of mass log information | |
CN102663062A (en) | Method and device for processing invalid links in search result | |
CN107957940B (en) | Test log processing method, system and terminal | |
CN104317948A (en) | Page data capturing method and system | |
CN111859076B (en) | Data crawling method, device, computer equipment and computer readable storage medium | |
CN105335516A (en) | Construction method of universal acquisition system | |
CN104038821A (en) | Method for uniformly gathering fault information of each functional module of Android television | |
CN105808417A (en) | Automated testing method and proxy server | |
CN103823907A (en) | Method, device and engine for integrating on-line video resource addresses | |
CN104391978A (en) | Method and device for storing and processing web pages of browsers | |
KR20170073693A (en) | Extracting similar group elements | |
CN104572934A (en) | Webpage key content extracting method based on DOM | |
CN106547803B (en) | Method and device for crawling incremental resources of website | |
CN104317857A (en) | House information acquisition service system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C02 | Deemed withdrawal of patent application after publication (patent law 2001) | ||
WD01 | Invention patent application deemed withdrawn after publication |
Open date: 20090527 |