CN109948018A

CN109948018A - A kind of Web structural data rapid extracting method and system

Info

Publication number: CN109948018A
Application number: CN201910021939.9A
Authority: CN
Inventors: 刘芳作; 陈薇; 王腾蛟; 李强; 邱镇; 崔迎宝
Original assignee: Peking University; State Grid Corp of China SGCC; State Grid Information and Telecommunication Co Ltd; State Grid Zhejiang Electric Power Co Ltd
Current assignee: Peking University; State Grid Corp of China SGCC; State Grid Information and Telecommunication Co Ltd; State Grid Zhejiang Electric Power Co Ltd
Priority date: 2019-01-10
Filing date: 2019-01-10
Publication date: 2019-06-28
Anticipated expiration: 2039-01-10
Also published as: CN109948018B

Abstract

The present invention is a kind of Web structural data rapid extracting method, and step includes: the website links in level traversal target web site page, and records the network address of all websites with tabular form and corresponding network address jumps sequence；The network address of target webpage is identified from above-mentioned list, and records corresponding network address and jumps sequence, and the sequencing that according to target website to target webpage accesses carries out network address splicing, and the network address for generating access target webpage jumps sequence；Sequence is jumped according to the optimal network address that greedy random self-adaptive search algorithm finds out access target webpage；Sequential access target webpage is jumped according to above-mentioned optimal network address, level traverses all data item labels of target pages, finds all target data item labels；Most short label path is found out from above-mentioned all target data item labels；Sequence is jumped according to above-mentioned optimal network address and shortest path crawls the target data of target webpage.The present invention also provides a kind of Web structural data quick extraction systems.

Description

A kind of Web structural data rapid extracting method and system

Technical field

The present invention crawls related to web data, is related to website and webpage structure and file access algorithm field, is a kind of benefit It, can be to the web data item of specified extraction with adaptive algorithm and the matched Web page structural data extraction method of label data Progress is precisely rapidly extracted.

Background technique

Existing web data acquisition modes are mainly divided to two kinds, and one is first acquisition the whole network pages, and data to be locally extracted again, Another kind is that direct page target data of extracting are stored in local.

It first obtains the whole network page and data is locally extracted again, i.e., access for all target pages in targeted website, All information of target pages are stored after obtaining the feedback of target pages.The storage of all page infos is to being locally stored sky Between after, then local filter extraction is carried out to the target data item extracted of needs.The advantages of this mode, is compatible strong, performance Stablize, belong to industrialization thought, is current each major company's main way currently in use, such as 360, Jingdone district cloud etc..Compatibility It is this mode distributed page access easy to accomplish and acquisition, thousands of hosts can be run simultaneously to the mesh of targeted website Mark webpage accesses, and belongs to brute-force algorithm, does not need excessive url-path optimization, is the quantity conduct of host in a distributed manner Compensation.And every host can request access to different target pages simultaneously, both save the time of load target webpage in this way It can be applicable in different target website again, without changing, so having compatibility well.Stability is that this mode exists Before extracting target data item, all page downloads where data item local has been arrived.The data item for needing to extract The logics such as structure and field can be in processing locality.Have the staff of demand locally marked target data item and then Unified calculation processing is carried out by local server, obtains target data.The defect of this mode is to be counted using which It is higher according to the threshold of extraction, need a large amount of clusters of machines, the bandwidth of high speed, a large amount of local storage space and powerful calculation Power.This mode, also than more serious, is mainly reflected in a large amount of and unrelated target data web data all to the waste of resource It wants occupied bandwidth and flow locally downloading, and locally also to provide a large amount of memory space to store a large amount of and extraction data Unrelated Webpage information.

It directly extracts page target data and is stored in local, i.e., directly carry out data filtering during accession page, Local, such as scrapy frame etc. are stored in after disposable extraction target data.This mode only needs server entrance to crawl Target webpage is opened behind website, is that target information is extracted in page info in load target webpage.Directly extract page target number According to local is stored in, network bandwidth and local storage space are utilized higher, but in the compatibility and stabilization for extracting data Property on but be not so good as first way, since the extraction of target data item is completed during loading page, so different The webpage of structure is also different to the extraction logic of data item.It can support to access in the case where same page structure, not In the case where with page structure, such as different web sites, it is necessary to rebuild web data extraction logic.Due to target data The extraction of item depends on the load of webpage, once webpage load failure, the extraction of data item also will fail, so extraction performance is not Stablize.In addition, directly extracting page target data is stored in local, webpage load delay and page data filtration time are all superimposed On the extraction time of single data, so that the extraction rate of single data is slower.

In conclusion there are bandwidth traffics to waste, local storage space waste is low with compatibility, stablizes for both the above method The problems such as property is low, and these problems are not all mutually merged and are improved.

Summary of the invention

The present invention overcomes deficiency existing for above-mentioned existing correlation method, proposes one kind aiming at the problem that web data extracts Web structural data rapid extracting method and system utilize adaptive algorithm and the matched Web page structural data of label data Extracting method to the web data item of specified extraction can precisely rapidly extract.

In order to solve the above technical problems, the technical solution adopted by the present invention are as follows:

A kind of Web structural data rapid extracting method, step include:

1) website links in the level traversal target web site page to be accessed, and all nets are recorded in the form of a list Location and network address jump sequence (i.e. network address logic)；

2) network address of target webpage is identified from above-mentioned list；

3) determine that each network address of target webpage place website jumps the position in sequence list by matching；

4) it carries out repeatedly access network address and jumps the target webpage in sequence list, according to the random self-adaptive search algorithm of greed It records optimal network address and jumps sequence；

5) sequential access target webpage is jumped according to above-mentioned optimal network address, level traverses all data item marks of target pages Label；

6) all target data item labels are found；

7) multiple target data item labels are compared, shortest path is found out；

8) sequence is jumped according to above-mentioned optimal network address and shortest path crawls the target data of target webpage.

A kind of Web structural data quick extraction system, comprising:

Including determining object module, for carrying out attribute confirmation, including confirmation targeted website to the target data to be extracted Network address, the network address of target webpage, the page-tag of target webpage；

Network address logic generation module finds out access target webpage for concluding the relationship of target webpage in the target website Optimal network address jump sequence；

Tag path generation module is matched for the page-tag to target webpage, finds out target data item label Shortest path；

Process accesses Web page module, for jumping sequential access target webpage according to optimal network address；

Thread extracts data module, for being extracted according to shortest path to the target data on target webpage.

The present invention is on the basis of providing determining targeted website, target network address and target data, by heuristically visiting It asks Website page, parses the website structure of targeted website, target network in targeted website is then summarized by adaptive algorithm Network address between location jumps sequence, and after entering target webpage, the page data label and number of targets of Auto-matching webpage According to data item label, then search out the shortest path of target data item label, carry out the extraction of target data.Pass through network address The optimization of sequence and the optimization of data extraction path are jumped, target webpage can be first obtained and data is locally extracted again or directly mention The target data of target webpage is taken to be stored in local, so that the present invention is much higher than two kinds of traditional sides in data extraction efficiency Formula.

Detailed description of the invention

Fig. 1 is one of embodiment Web structural data quick extraction system frame diagram.

Fig. 2 is one of embodiment Web structural data rapid extracting method flow chart.

Specific embodiment

To enable features described above and advantage of the invention to be clearer and more comprehensible, special embodiment below, and institute's attached drawing is cooperated to make Detailed description are as follows.

Fig. 1 is a kind of frame diagram for Web structural data quick extraction system that the present embodiment proposes, is specifically included as follows Module:

102: determining object module, be responsible for carrying out the target data to be extracted the confirmation of association attributes, including confirmation needs Crawl the network address of targeted website, the network address of target webpage, the page-tag of target webpage；

103: network address logic generation module is responsible for the parsing to targeted website structure, summarizes target webpage in target network Relationship in standing, the optimal network address for providing access target webpage jump sequence, and fast and easy enters target webpage；

104: tag path generation module is responsible for matching the page-tag of target webpage, finds target data item The shortest path of label, fast and easy extract target data；

105: process accesses Web page module, is responsible for process after obtaining optimal network address and jumping sequence, mentions to target webpage Access request out, and load accession page；

106: thread extracts data module, is responsible on the basis of process loads target webpage, thread is according to tag path The shortest path that generation module provides extracts the target data on target webpage.

Fig. 2 is a kind of Web structural data rapid extracting method flow chart that the present embodiment proposes, is included the following steps:

202: under conditions of given specific format, which be can be customized by users, such as using https as division Keyword carries out direct splicing, supports extension, according to corresponding format, determines the network address of targeted website and the network address of target webpage, Two network address are sequentially spliced and inputted；

203: level traverses the website links in target web site page, and with the network address inside tabular form record website Sequence is jumped with the network address of access；

204: identifying the network address of target webpage in above-mentioned list, and record corresponding network address and jump sequence；

205: judging at least one target webpage whether has been recorded in the network address of above-mentioned identified target webpage Network address jump sequence, if yes then enter 206, otherwise return to 203 and continue to traverse targeted website and record network address to jump sequence；

206: being jumped in sequence in the target webpage network address for having record, the elder generation that according to target website to target webpage accesses Sequence afterwards, by network address direct splicing at character string, the network address for generating access target webpage jumps sequence；

207: taking greedy random self-adaptive search algorithm to select the network address jump for merging shortest access target webpage out Turn sequence, jumps sequence as the optimal network address from targeted website access target webpage, and record；

208: testing generated network address and jump sequence, use whether least hops revolution jumps sequence as verifying network address Reach optimisation criteria, reach and then enter 209, otherwise returns to the network address that 207 continue to generate optimization and jump sequence；

209: accessing request to target webpage, and load the page info of target webpage；

210: structure classification being carried out to the page info of target webpage, i.e., addressable data item all on target webpage Recorded, and record data item label corresponding to each data item, data item label according to expansion hierarchical structure into Row is sorted out.And categorized data item and data item label information are recorded, the pretreatment as traversal target data；

211: all data item information that traversal has been recorded search target data item, and record target data institute Corresponding all labels, the set for all data item labels that each data item is possessed are label construction information；

212: in the target data item label recorded, the number of target data item label is counted, if All target data item labels are found, then enter 213, if there is missing after searching, 211 is returned and continues to traverse label data；

213: in the label data recorded, extracting the structural information of label, and record label construction information；

214: in the label construction information recorded, label construction information being sorted out again, i.e., according to data The hierarchical structure of item label expansion carries out similitude classification；

215: in the label construction information sorted out, being chosen using adaptive principle and merge label construction, merged identical Path, and label construction is converted to the format of tree, generate shortest path；

216: using label construction information depth (i.e. path length) most shallow evaluation criteria to generated shortest path It is compared test verifying, which is relatively reached by depth from each other, if reaching testing standard, enters 217, Otherwise 213 are returned to continue to extract label construction information；

217: jumping sequence obtaining optimal network address and on the basis of shortest path, be again started up data and crawl process, visit Ask a series of target webpages for needing to extract data.

For testing website (good doctor website):

The Comparative result of all kinds of web data extracting methods of table 1.

By above-mentioned table 1 it is found that compared to conventional method, the method for the present invention needs less when extracting Web structural data Host number and process number, less access webpage number and extract number of data, the time crawled is shorter, memory space benefit It is higher with rate, it can be seen that the method for the present invention can quickly extract Web structural data.

The above embodiments are merely illustrative of the technical solutions of the present invention rather than is limited, the ordinary skill of this field Personnel can be with modification or equivalent replacement of the technical solution of the present invention are made, without departing from the spirit and scope of the present invention, this The protection scope of invention should subject to the claims.

Claims

1. a kind of Web structural data rapid extracting method, step include:

Level traverses the website links in target web site page, and records the network address of all websites and corresponding with tabular form Network address jumps sequence；

The network address of target webpage is identified from above-mentioned list, and records corresponding network address and jumps sequence, according to target website to target The sequencing of web page access carries out network address splicing, and the network address for generating access target webpage jumps sequence；

Sequence is jumped according to the optimal network address that greedy random self-adaptive search algorithm finds out access target webpage；

Sequential access target webpage is jumped according to above-mentioned optimal network address, level traverses all data item labels of target pages, looks for To all target data item labels；

Most short label path is found out from above-mentioned all target data item labels；

Sequence is jumped according to above-mentioned optimal network address and shortest path crawls the target data of target webpage.

2. the method as described in claim 1, which is characterized in that before being traversed, to the Web structural data to be extracted Attribute confirmation is carried out, the page-tag of network address, the network address of target webpage and target webpage including confirmation targeted website.

3. the method as described in claim 1, which is characterized in that if do not recorded in the network address of identified target webpage The network address of target webpage jumps sequence, then continues the website links for traversing target web site page.

4. the method as described in claim 1, which is characterized in that use least hops revolution as the net of authentication-access target webpage Location jumps whether sequence has reached optimisation criteria, if not, the network address for continuing to generate optimization jumps sequence, until obtaining access mesh The optimal network address of mark webpage jumps sequence.

5. the method as described in claim 1, which is characterized in that when access target webpage, may have access to all on target webpage Data item recorded, and data item label corresponding to each data item is recorded, data item label according to the layer of expansion Secondary structure is sorted out, and records categorized data item and data item label information.

6. the method as described in claim 1, which is characterized in that the number for counting the target data item label found, if still All target data item labels are not found, then continue the data item label for traversing target pages.

7. the method as described in claim 1, which is characterized in that the method for finding out shortest path is, to all targets found Data item label carries out similitude classification according to the hierarchical structure of expansion, merges same paths using adaptive principle, and by mesh Mark data item label is converted to the format of tree, to generate shortest path.

8. the method as described in claim 1, which is characterized in that one evaluation criteria of setting, more all target data item labels Path length, shortest length is reached into the target data item tag path of the standard as shortest path.

9. a kind of Web structural data quick extraction system, comprising:

Network address logic generation module finds out access target webpage most for concluding the relationship of target webpage in the target website Excellent network address jumps sequence；

Tag path generation module is matched for the page-tag to target webpage, finds out target data item label most Short path；

10. system as claimed in claim 9, which is characterized in that further include determining object module, for the target to be extracted Data carry out attribute confirmation, network address, the network address of target webpage, the page-tag of target webpage including confirming targeted website.