CN101441629A - Automatic acquiring method of non-structured web page information - Google Patents

Automatic acquiring method of non-structured web page information Download PDF

Info

Publication number
CN101441629A
CN101441629A CNA2007101706017A CN200710170601A CN101441629A CN 101441629 A CN101441629 A CN 101441629A CN A2007101706017 A CNA2007101706017 A CN A2007101706017A CN 200710170601 A CN200710170601 A CN 200710170601A CN 101441629 A CN101441629 A CN 101441629A
Authority
CN
China
Prior art keywords
web page
page information
information
acquiring method
website
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CNA2007101706017A
Other languages
Chinese (zh)
Inventor
金骏
戴斌华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
XINNA ADVERTISEMENT MEDIA CO Ltd SHANGHAI
Original Assignee
XINNA ADVERTISEMENT MEDIA CO Ltd SHANGHAI
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by XINNA ADVERTISEMENT MEDIA CO Ltd SHANGHAI filed Critical XINNA ADVERTISEMENT MEDIA CO Ltd SHANGHAI
Priority to CNA2007101706017A priority Critical patent/CN101441629A/en
Publication of CN101441629A publication Critical patent/CN101441629A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a method for the automatic acquisition of information of a nonstructural webpage. The method comprises the following steps: (1) a spider acquisition computer system reads a website link table from a data storage device; (2) whether a website to be acquired is in the website link table is detected; if not, acquisition is finished; (3) if a detection result of the step two is yes, a resolution rule corresponding to the website to be acquired is selected; (4) at least one thread is established; and at least one thread resolves a page of the current website through the selected resolution rule; and (5) after resolution is finished, webpage information and acquisition state information needing to be stored are stored in the data storage device; and the step two is returned. Compared with the prior art, the method extracts the nonstructural information from source pages of various websites and stores the information in a system of a structural database; and through the method, mass manpower and fund can be saved in terms of information acquisition and integration.

Description

A kind of automatic acquiring method of non-structured web page information
Technical field
The present invention relates to technical field of the computer network, particularly a kind of automatic acquiring method of non-structured web page information.
Background technology
In present life, acquisition system is widely used in industry portal website, CIS, Knowledge Management System, web site contents system, fields such as scientific research.So-called acquisition system extracts the system that is saved in the structurized database with non-structured information exactly from the page of source, various website.
Summary of the invention
Technical matters to be solved by this invention is exactly the automatic acquiring method that a kind of non-structured web page information is provided for the defective that overcomes above-mentioned prior art existence.
Purpose of the present invention can be achieved through the following technical solutions: a kind of automatic acquiring method of non-structured web page information, it is characterized in that, and may further comprise the steps:
1) spider collecting computer system reads the website links table from data storage device;
2) detect in the website links table whether network address to be collected is arranged, if not, then gather and finish;
3) if step 2) testing result for being then to select decomposition rule corresponding to network address to be collected;
4) create at least one thread, this at least one thread decomposes the page of current network address by selected decomposition rule;
5) after decomposition was finished, info web and acquisition state information that need are preserved were saved in data storage device, and return step 2).
Described non-structured web page information comprises title, description, picture.
Described decomposition rule adopts regular expression.
Between described spider collecting computer system and the web station system to be collected by the mutual communication of URL(Uniform Resource Locator) (Url) mode based on the Http agreement.
Compared with prior art, the present invention extracts the system that is saved in the structurized database with non-structured information from the page of source, various website, by method of the present invention, can save great amount of manpower and fund aspect the information gathering integration.
Description of drawings
Fig. 1 is a process flow diagram of the present invention;
Fig. 2 is a schematic diagram of the present invention.
Embodiment
The utility model is described in further detail below in conjunction with accompanying drawing.
As shown in Figure 1, 2, a kind of automatic acquiring method of non-structured web page information may further comprise the steps:
1) spider collecting computer system reads the website links table from data storage device;
2) detect in the website links table whether network address to be collected is arranged, if not, then gather and finish;
3) if step 2) testing result for being then to select decomposition rule corresponding to network address to be collected;
4) create at least one thread, this at least one thread decomposes the page of current network address by selected decomposition rule;
5) after decomposition was finished, info web and acquisition state information that need are preserved were saved in data storage device, and return step 2).
Described non-structured web page information comprises title, description, picture; Described decomposition rule adopts regular expression; Between described spider collecting computer system and the web station system to be collected by the mutual communication of URL(Uniform Resource Locator) (Url) mode based on the Http agreement.
Set up a cover spider collecting computer system in network server end, set up a cover and preserve the data storage device that collects; At each website image data, by URL(Uniform Resource Locator) (URL) mode based on HTTP(Hypertext Transport Protocol) realize each other communication between spider collecting computer system and the web station system that need to gather by network.
Wherein:
Data storage device is used to store data, and spider foreground capture program is from obtaining the lists of links of be about to gathering here, and corresponding acquisition state also can be updated in here.
Spider collecting computer system is used to handle each link, downloads to connect the concrete page, and judge and adopt any component to separate rule, and real the decomposition.
In the present embodiment, the spider acquisition system at first can be read in the collection lists of links (source tabulation) that is kept in advance in the storage system, information in the tabulation is link and the current state that specifically needs collection, after gathering beginning, at first can start thread, follow concrete thread and will judge that use which component separates rule according to the chained address of current collection according to the number of threads of setting and maximum preservation picture numbers of setting.Then, the spider acquisition system can be downloaded and decomposes the corresponding page source code according to employed decomposition rule, and present embodiment uses regular expression to decompose the various attributes preserved of being necessary, such as: title, description, picture tabulation or the like.So all-links has to the last been decomposed in circulation.In the above decomposable process, if figure is arranged then preserve these figure (in the catalogue of spider acquisition system place) in corresponding catalogue, and the data that collect can be saved in the storage system.
In said method, applied to the multithreading execution.This is that holding time is long because consider that the data volume of collection is many.Each thread sends each signal by entrusting, and expresses and upgrades, and main then interface is according to each element value and statistical information in the parameter update interface.After a thread had decomposed a link, the function that can call in the master routine obtained next chained address, had then to decompose, and did not have then to stop current thread.With this until last.
In said method, used regular expression to decompose attribute.This is because the utilization regular expression can be simplified many character string problems.
The website links that needs the needs stored to gather in the memory storage, this can be to be undertaken by other recording programs, also can be that other importing programs import.
Present embodiment can start corresponding thread and handle these links simultaneously, and each thread also can continue to handle successive links according to self finishing the decomposition situation, ends up to no follow-up being linked as.

Claims (4)

1. the automatic acquiring method of a non-structured web page information is characterized in that, may further comprise the steps:
1) spider collecting computer system reads the website links table from data storage device;
2) detect in the website links table whether network address to be collected is arranged, if not, then gather and finish;
3) if step 2) testing result for being then to select decomposition rule corresponding to network address to be collected;
4) create at least one thread, this at least one thread decomposes the page of current network address by selected decomposition rule;
5) after decomposition was finished, info web and acquisition state information that need are preserved were saved in data storage device, and return step 2).
2. the automatic acquiring method of a kind of non-structured web page information according to claim 1 is characterized in that, described non-structured web page information comprises title, description, picture.
3. the automatic acquiring method of a kind of non-structured web page information according to claim 1 is characterized in that, described decomposition rule adopts regular expression.
4. the automatic acquiring method of a kind of non-structured web page information according to claim 1, it is characterized in that, between described spider collecting computer system and the web station system to be collected by the mutual communication of URL(Uniform Resource Locator) (Url) mode based on the Http agreement.
CNA2007101706017A 2007-11-19 2007-11-19 Automatic acquiring method of non-structured web page information Pending CN101441629A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNA2007101706017A CN101441629A (en) 2007-11-19 2007-11-19 Automatic acquiring method of non-structured web page information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNA2007101706017A CN101441629A (en) 2007-11-19 2007-11-19 Automatic acquiring method of non-structured web page information

Publications (1)

Publication Number Publication Date
CN101441629A true CN101441629A (en) 2009-05-27

Family

ID=40726067

Family Applications (1)

Application Number Title Priority Date Filing Date
CNA2007101706017A Pending CN101441629A (en) 2007-11-19 2007-11-19 Automatic acquiring method of non-structured web page information

Country Status (1)

Country Link
CN (1) CN101441629A (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102096705A (en) * 2010-12-31 2011-06-15 南威软件股份有限公司 Article acquisition method
CN102118400A (en) * 2009-12-31 2011-07-06 北京四维图新科技股份有限公司 Data acquisition method and system
CN102298588A (en) * 2010-06-25 2011-12-28 株式会社理光 Method and device for extracting object from non-structured document
CN102915373A (en) * 2012-11-06 2013-02-06 无锡江南计算技术研究所 Data storage method and device
CN102955795A (en) * 2011-08-24 2013-03-06 句容今太科技园有限公司 Web information acquisition system
CN104462140A (en) * 2013-09-24 2015-03-25 北大方正集团有限公司 Webpage data collecting method and device
CN104462566A (en) * 2014-12-26 2015-03-25 中科宇图天下科技有限公司 Environmental protection information grid capturing method
CN104484485A (en) * 2015-01-09 2015-04-01 携程计算机技术(上海)有限公司 Ticket number information extracting and outputting method and system and ticket discharging system
CN104965904A (en) * 2015-06-30 2015-10-07 北京奇虎科技有限公司 Multi-platform data grabbing method and apparatus
CN105468664A (en) * 2015-05-12 2016-04-06 北京众标网络科技有限公司 Information acquisition method and apparatus
CN105825399A (en) * 2016-03-15 2016-08-03 焦点科技股份有限公司 Internet based B2B e-commerce information collecting method
CN107918674A (en) * 2017-12-12 2018-04-17 携程旅游网络技术(上海)有限公司 Acquisition method and its system, storage medium, the electronic equipment of web data

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102118400A (en) * 2009-12-31 2011-07-06 北京四维图新科技股份有限公司 Data acquisition method and system
CN102118400B (en) * 2009-12-31 2013-07-17 北京四维图新科技股份有限公司 Data acquisition method and system
CN102298588A (en) * 2010-06-25 2011-12-28 株式会社理光 Method and device for extracting object from non-structured document
CN102298588B (en) * 2010-06-25 2014-04-30 株式会社理光 Method and device for extracting object from non-structured document
CN102096705A (en) * 2010-12-31 2011-06-15 南威软件股份有限公司 Article acquisition method
CN102955795A (en) * 2011-08-24 2013-03-06 句容今太科技园有限公司 Web information acquisition system
CN102915373B (en) * 2012-11-06 2016-08-10 无锡江南计算技术研究所 A kind of date storage method and device
CN102915373A (en) * 2012-11-06 2013-02-06 无锡江南计算技术研究所 Data storage method and device
CN104462140A (en) * 2013-09-24 2015-03-25 北大方正集团有限公司 Webpage data collecting method and device
CN104462566A (en) * 2014-12-26 2015-03-25 中科宇图天下科技有限公司 Environmental protection information grid capturing method
CN104462566B (en) * 2014-12-26 2017-11-21 中科宇图天下科技有限公司 A kind of environmental protection information grid grasping means
CN104484485A (en) * 2015-01-09 2015-04-01 携程计算机技术(上海)有限公司 Ticket number information extracting and outputting method and system and ticket discharging system
CN104484485B (en) * 2015-01-09 2019-03-12 上海携程商务有限公司 Draft bank information extraction deriving method, system and ticket system
CN105468664A (en) * 2015-05-12 2016-04-06 北京众标网络科技有限公司 Information acquisition method and apparatus
CN104965904A (en) * 2015-06-30 2015-10-07 北京奇虎科技有限公司 Multi-platform data grabbing method and apparatus
CN104965904B (en) * 2015-06-30 2019-06-04 北京奇虎科技有限公司 A kind of grasping means of multi-platform data and device
CN105825399A (en) * 2016-03-15 2016-08-03 焦点科技股份有限公司 Internet based B2B e-commerce information collecting method
CN107918674A (en) * 2017-12-12 2018-04-17 携程旅游网络技术(上海)有限公司 Acquisition method and its system, storage medium, the electronic equipment of web data

Similar Documents

Publication Publication Date Title
CN101441629A (en) Automatic acquiring method of non-structured web page information
CN107895009B (en) Distributed internet data acquisition method and system
CN101651707B (en) Method for automatically acquiring user behavior log of network
CN101370024B (en) Distributed information collection method and system
CN105243159A (en) Visual script editor-based distributed web crawler system
CN102662966B (en) Method and system for obtaining subject-oriented dynamic page content
CN104050281A (en) Webpage information extraction method and device based on http protocol
US8972374B2 (en) Content acquisition system and method of implementation
CN102760151B (en) Implementation method of open source software acquisition and searching system
CN109376291B (en) Website fingerprint information scanning method and device based on web crawler
CN102521232B (en) Distributed acquisition and processing system and method of internet metadata
CN103942210A (en) Processing method, device and system of mass log information
CN102663062A (en) Method and device for processing invalid links in search result
CN107957940B (en) Test log processing method, system and terminal
CN104317948A (en) Page data capturing method and system
CN111859076B (en) Data crawling method, device, computer equipment and computer readable storage medium
CN105335516A (en) Construction method of universal acquisition system
CN104038821A (en) Method for uniformly gathering fault information of each functional module of Android television
CN105808417A (en) Automated testing method and proxy server
CN103823907A (en) Method, device and engine for integrating on-line video resource addresses
CN104391978A (en) Method and device for storing and processing web pages of browsers
KR20170073693A (en) Extracting similar group elements
CN104572934A (en) Webpage key content extracting method based on DOM
CN106547803B (en) Method and device for crawling incremental resources of website
CN104317857A (en) House information acquisition service system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Open date: 20090527