CN101441629A

CN101441629A - Automatic acquiring method of non-structured web page information

Info

Publication number: CN101441629A
Application number: CNA2007101706017A
Authority: CN
Inventors: 金骏; 戴斌华
Original assignee: XINNA ADVERTISEMENT MEDIA CO Ltd SHANGHAI
Current assignee: XINNA ADVERTISEMENT MEDIA CO Ltd SHANGHAI
Priority date: 2007-11-19
Filing date: 2007-11-19
Publication date: 2009-05-27

Abstract

The invention relates to a method for the automatic acquisition of information of a nonstructural webpage. The method comprises the following steps: (1) a spider acquisition computer system reads a website link table from a data storage device; (2) whether a website to be acquired is in the website link table is detected; if not, acquisition is finished; (3) if a detection result of the step two is yes, a resolution rule corresponding to the website to be acquired is selected; (4) at least one thread is established; and at least one thread resolves a page of the current website through the selected resolution rule; and (5) after resolution is finished, webpage information and acquisition state information needing to be stored are stored in the data storage device; and the step two is returned. Compared with the prior art, the method extracts the nonstructural information from source pages of various websites and stores the information in a system of a structural database; and through the method, mass manpower and fund can be saved in terms of information acquisition and integration.

Description

A kind of automatic acquiring method of non-structured web page information

Technical field

The present invention relates to technical field of the computer network, particularly a kind of automatic acquiring method of non-structured web page information.

Background technology

In present life, acquisition system is widely used in industry portal website, CIS, Knowledge Management System, web site contents system, fields such as scientific research.So-called acquisition system extracts the system that is saved in the structurized database with non-structured information exactly from the page of source, various website.

Summary of the invention

Technical matters to be solved by this invention is exactly the automatic acquiring method that a kind of non-structured web page information is provided for the defective that overcomes above-mentioned prior art existence.

Purpose of the present invention can be achieved through the following technical solutions: a kind of automatic acquiring method of non-structured web page information, it is characterized in that, and may further comprise the steps:

1) spider collecting computer system reads the website links table from data storage device;

2) detect in the website links table whether network address to be collected is arranged, if not, then gather and finish;

3) if step 2) testing result for being then to select decomposition rule corresponding to network address to be collected;

4) create at least one thread, this at least one thread decomposes the page of current network address by selected decomposition rule;

5) after decomposition was finished, info web and acquisition state information that need are preserved were saved in data storage device, and return step 2).

Described non-structured web page information comprises title, description, picture.

Described decomposition rule adopts regular expression.

Between described spider collecting computer system and the web station system to be collected by the mutual communication of URL(Uniform Resource Locator) (Url) mode based on the Http agreement.

Compared with prior art, the present invention extracts the system that is saved in the structurized database with non-structured information from the page of source, various website, by method of the present invention, can save great amount of manpower and fund aspect the information gathering integration.

Description of drawings

Fig. 1 is a process flow diagram of the present invention;

Fig. 2 is a schematic diagram of the present invention.

Embodiment

The utility model is described in further detail below in conjunction with accompanying drawing.

As shown in Figure 1, 2, a kind of automatic acquiring method of non-structured web page information may further comprise the steps:

Described non-structured web page information comprises title, description, picture; Described decomposition rule adopts regular expression; Between described spider collecting computer system and the web station system to be collected by the mutual communication of URL(Uniform Resource Locator) (Url) mode based on the Http agreement.

Set up a cover spider collecting computer system in network server end, set up a cover and preserve the data storage device that collects; At each website image data, by URL(Uniform Resource Locator) (URL) mode based on HTTP(Hypertext Transport Protocol) realize each other communication between spider collecting computer system and the web station system that need to gather by network.

Wherein:

Data storage device is used to store data, and spider foreground capture program is from obtaining the lists of links of be about to gathering here, and corresponding acquisition state also can be updated in here.

Spider collecting computer system is used to handle each link, downloads to connect the concrete page, and judge and adopt any component to separate rule, and real the decomposition.

In the present embodiment, the spider acquisition system at first can be read in the collection lists of links (source tabulation) that is kept in advance in the storage system, information in the tabulation is link and the current state that specifically needs collection, after gathering beginning, at first can start thread, follow concrete thread and will judge that use which component separates rule according to the chained address of current collection according to the number of threads of setting and maximum preservation picture numbers of setting.Then, the spider acquisition system can be downloaded and decomposes the corresponding page source code according to employed decomposition rule, and present embodiment uses regular expression to decompose the various attributes preserved of being necessary, such as: title, description, picture tabulation or the like.So all-links has to the last been decomposed in circulation.In the above decomposable process, if figure is arranged then preserve these figure (in the catalogue of spider acquisition system place) in corresponding catalogue, and the data that collect can be saved in the storage system.

In said method, applied to the multithreading execution.This is that holding time is long because consider that the data volume of collection is many.Each thread sends each signal by entrusting, and expresses and upgrades, and main then interface is according to each element value and statistical information in the parameter update interface.After a thread had decomposed a link, the function that can call in the master routine obtained next chained address, had then to decompose, and did not have then to stop current thread.With this until last.

In said method, used regular expression to decompose attribute.This is because the utilization regular expression can be simplified many character string problems.

The website links that needs the needs stored to gather in the memory storage, this can be to be undertaken by other recording programs, also can be that other importing programs import.

Present embodiment can start corresponding thread and handle these links simultaneously, and each thread also can continue to handle successive links according to self finishing the decomposition situation, ends up to no follow-up being linked as.

Claims

1. the automatic acquiring method of a non-structured web page information is characterized in that, may further comprise the steps:

2. the automatic acquiring method of a kind of non-structured web page information according to claim 1 is characterized in that, described non-structured web page information comprises title, description, picture.

3. the automatic acquiring method of a kind of non-structured web page information according to claim 1 is characterized in that, described decomposition rule adopts regular expression.

4. the automatic acquiring method of a kind of non-structured web page information according to claim 1, it is characterized in that, between described spider collecting computer system and the web station system to be collected by the mutual communication of URL(Uniform Resource Locator) (Url) mode based on the Http agreement.