CN109948018A - A kind of Web structural data rapid extracting method and system - Google Patents

A kind of Web structural data rapid extracting method and system Download PDF

Info

Publication number
CN109948018A
CN109948018A CN201910021939.9A CN201910021939A CN109948018A CN 109948018 A CN109948018 A CN 109948018A CN 201910021939 A CN201910021939 A CN 201910021939A CN 109948018 A CN109948018 A CN 109948018A
Authority
CN
China
Prior art keywords
target
network address
data item
target webpage
webpage
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910021939.9A
Other languages
Chinese (zh)
Other versions
CN109948018B (en
Inventor
刘芳作
陈薇
王腾蛟
李强
邱镇
崔迎宝
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University
State Grid Corp of China SGCC
State Grid Information and Telecommunication Co Ltd
State Grid Zhejiang Electric Power Co Ltd
Original Assignee
Peking University
State Grid Corp of China SGCC
State Grid Information and Telecommunication Co Ltd
State Grid Zhejiang Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University, State Grid Corp of China SGCC, State Grid Information and Telecommunication Co Ltd, State Grid Zhejiang Electric Power Co Ltd filed Critical Peking University
Priority to CN201910021939.9A priority Critical patent/CN109948018B/en
Publication of CN109948018A publication Critical patent/CN109948018A/en
Application granted granted Critical
Publication of CN109948018B publication Critical patent/CN109948018B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The present invention is a kind of Web structural data rapid extracting method, and step includes: the website links in level traversal target web site page, and records the network address of all websites with tabular form and corresponding network address jumps sequence;The network address of target webpage is identified from above-mentioned list, and records corresponding network address and jumps sequence, and the sequencing that according to target website to target webpage accesses carries out network address splicing, and the network address for generating access target webpage jumps sequence;Sequence is jumped according to the optimal network address that greedy random self-adaptive search algorithm finds out access target webpage;Sequential access target webpage is jumped according to above-mentioned optimal network address, level traverses all data item labels of target pages, finds all target data item labels;Most short label path is found out from above-mentioned all target data item labels;Sequence is jumped according to above-mentioned optimal network address and shortest path crawls the target data of target webpage.The present invention also provides a kind of Web structural data quick extraction systems.

Description

A kind of Web structural data rapid extracting method and system
Technical field
The present invention crawls related to web data, is related to website and webpage structure and file access algorithm field, is a kind of benefit It, can be to the web data item of specified extraction with adaptive algorithm and the matched Web page structural data extraction method of label data Progress is precisely rapidly extracted.
Background technique
Existing web data acquisition modes are mainly divided to two kinds, and one is first acquisition the whole network pages, and data to be locally extracted again, Another kind is that direct page target data of extracting are stored in local.
It first obtains the whole network page and data is locally extracted again, i.e., access for all target pages in targeted website, All information of target pages are stored after obtaining the feedback of target pages.The storage of all page infos is to being locally stored sky Between after, then local filter extraction is carried out to the target data item extracted of needs.The advantages of this mode, is compatible strong, performance Stablize, belong to industrialization thought, is current each major company's main way currently in use, such as 360, Jingdone district cloud etc..Compatibility It is this mode distributed page access easy to accomplish and acquisition, thousands of hosts can be run simultaneously to the mesh of targeted website Mark webpage accesses, and belongs to brute-force algorithm, does not need excessive url-path optimization, is the quantity conduct of host in a distributed manner Compensation.And every host can request access to different target pages simultaneously, both save the time of load target webpage in this way It can be applicable in different target website again, without changing, so having compatibility well.Stability is that this mode exists Before extracting target data item, all page downloads where data item local has been arrived.The data item for needing to extract The logics such as structure and field can be in processing locality.Have the staff of demand locally marked target data item and then Unified calculation processing is carried out by local server, obtains target data.The defect of this mode is to be counted using which It is higher according to the threshold of extraction, need a large amount of clusters of machines, the bandwidth of high speed, a large amount of local storage space and powerful calculation Power.This mode, also than more serious, is mainly reflected in a large amount of and unrelated target data web data all to the waste of resource It wants occupied bandwidth and flow locally downloading, and locally also to provide a large amount of memory space to store a large amount of and extraction data Unrelated Webpage information.
It directly extracts page target data and is stored in local, i.e., directly carry out data filtering during accession page, Local, such as scrapy frame etc. are stored in after disposable extraction target data.This mode only needs server entrance to crawl Target webpage is opened behind website, is that target information is extracted in page info in load target webpage.Directly extract page target number According to local is stored in, network bandwidth and local storage space are utilized higher, but in the compatibility and stabilization for extracting data Property on but be not so good as first way, since the extraction of target data item is completed during loading page, so different The webpage of structure is also different to the extraction logic of data item.It can support to access in the case where same page structure, not In the case where with page structure, such as different web sites, it is necessary to rebuild web data extraction logic.Due to target data The extraction of item depends on the load of webpage, once webpage load failure, the extraction of data item also will fail, so extraction performance is not Stablize.In addition, directly extracting page target data is stored in local, webpage load delay and page data filtration time are all superimposed On the extraction time of single data, so that the extraction rate of single data is slower.
In conclusion there are bandwidth traffics to waste, local storage space waste is low with compatibility, stablizes for both the above method The problems such as property is low, and these problems are not all mutually merged and are improved.
Summary of the invention
The present invention overcomes deficiency existing for above-mentioned existing correlation method, proposes one kind aiming at the problem that web data extracts Web structural data rapid extracting method and system utilize adaptive algorithm and the matched Web page structural data of label data Extracting method to the web data item of specified extraction can precisely rapidly extract.
In order to solve the above technical problems, the technical solution adopted by the present invention are as follows:
A kind of Web structural data rapid extracting method, step include:
1) website links in the level traversal target web site page to be accessed, and all nets are recorded in the form of a list Location and network address jump sequence (i.e. network address logic);
2) network address of target webpage is identified from above-mentioned list;
3) determine that each network address of target webpage place website jumps the position in sequence list by matching;
4) it carries out repeatedly access network address and jumps the target webpage in sequence list, according to the random self-adaptive search algorithm of greed It records optimal network address and jumps sequence;
5) sequential access target webpage is jumped according to above-mentioned optimal network address, level traverses all data item marks of target pages Label;
6) all target data item labels are found;
7) multiple target data item labels are compared, shortest path is found out;
8) sequence is jumped according to above-mentioned optimal network address and shortest path crawls the target data of target webpage.
A kind of Web structural data quick extraction system, comprising:
Including determining object module, for carrying out attribute confirmation, including confirmation targeted website to the target data to be extracted Network address, the network address of target webpage, the page-tag of target webpage;
Network address logic generation module finds out access target webpage for concluding the relationship of target webpage in the target website Optimal network address jump sequence;
Tag path generation module is matched for the page-tag to target webpage, finds out target data item label Shortest path;
Process accesses Web page module, for jumping sequential access target webpage according to optimal network address;
Thread extracts data module, for being extracted according to shortest path to the target data on target webpage.
The present invention is on the basis of providing determining targeted website, target network address and target data, by heuristically visiting It asks Website page, parses the website structure of targeted website, target network in targeted website is then summarized by adaptive algorithm Network address between location jumps sequence, and after entering target webpage, the page data label and number of targets of Auto-matching webpage According to data item label, then search out the shortest path of target data item label, carry out the extraction of target data.Pass through network address The optimization of sequence and the optimization of data extraction path are jumped, target webpage can be first obtained and data is locally extracted again or directly mention The target data of target webpage is taken to be stored in local, so that the present invention is much higher than two kinds of traditional sides in data extraction efficiency Formula.
Detailed description of the invention
Fig. 1 is one of embodiment Web structural data quick extraction system frame diagram.
Fig. 2 is one of embodiment Web structural data rapid extracting method flow chart.
Specific embodiment
To enable features described above and advantage of the invention to be clearer and more comprehensible, special embodiment below, and institute's attached drawing is cooperated to make Detailed description are as follows.
Fig. 1 is a kind of frame diagram for Web structural data quick extraction system that the present embodiment proposes, is specifically included as follows Module:
102: determining object module, be responsible for carrying out the target data to be extracted the confirmation of association attributes, including confirmation needs Crawl the network address of targeted website, the network address of target webpage, the page-tag of target webpage;
103: network address logic generation module is responsible for the parsing to targeted website structure, summarizes target webpage in target network Relationship in standing, the optimal network address for providing access target webpage jump sequence, and fast and easy enters target webpage;
104: tag path generation module is responsible for matching the page-tag of target webpage, finds target data item The shortest path of label, fast and easy extract target data;
105: process accesses Web page module, is responsible for process after obtaining optimal network address and jumping sequence, mentions to target webpage Access request out, and load accession page;
106: thread extracts data module, is responsible on the basis of process loads target webpage, thread is according to tag path The shortest path that generation module provides extracts the target data on target webpage.
Fig. 2 is a kind of Web structural data rapid extracting method flow chart that the present embodiment proposes, is included the following steps:
202: under conditions of given specific format, which be can be customized by users, such as using https as division Keyword carries out direct splicing, supports extension, according to corresponding format, determines the network address of targeted website and the network address of target webpage, Two network address are sequentially spliced and inputted;
203: level traverses the website links in target web site page, and with the network address inside tabular form record website Sequence is jumped with the network address of access;
204: identifying the network address of target webpage in above-mentioned list, and record corresponding network address and jump sequence;
205: judging at least one target webpage whether has been recorded in the network address of above-mentioned identified target webpage Network address jump sequence, if yes then enter 206, otherwise return to 203 and continue to traverse targeted website and record network address to jump sequence;
206: being jumped in sequence in the target webpage network address for having record, the elder generation that according to target website to target webpage accesses Sequence afterwards, by network address direct splicing at character string, the network address for generating access target webpage jumps sequence;
207: taking greedy random self-adaptive search algorithm to select the network address jump for merging shortest access target webpage out Turn sequence, jumps sequence as the optimal network address from targeted website access target webpage, and record;
208: testing generated network address and jump sequence, use whether least hops revolution jumps sequence as verifying network address Reach optimisation criteria, reach and then enter 209, otherwise returns to the network address that 207 continue to generate optimization and jump sequence;
209: accessing request to target webpage, and load the page info of target webpage;
210: structure classification being carried out to the page info of target webpage, i.e., addressable data item all on target webpage Recorded, and record data item label corresponding to each data item, data item label according to expansion hierarchical structure into Row is sorted out.And categorized data item and data item label information are recorded, the pretreatment as traversal target data;
211: all data item information that traversal has been recorded search target data item, and record target data institute Corresponding all labels, the set for all data item labels that each data item is possessed are label construction information;
212: in the target data item label recorded, the number of target data item label is counted, if All target data item labels are found, then enter 213, if there is missing after searching, 211 is returned and continues to traverse label data;
213: in the label data recorded, extracting the structural information of label, and record label construction information;
214: in the label construction information recorded, label construction information being sorted out again, i.e., according to data The hierarchical structure of item label expansion carries out similitude classification;
215: in the label construction information sorted out, being chosen using adaptive principle and merge label construction, merged identical Path, and label construction is converted to the format of tree, generate shortest path;
216: using label construction information depth (i.e. path length) most shallow evaluation criteria to generated shortest path It is compared test verifying, which is relatively reached by depth from each other, if reaching testing standard, enters 217, Otherwise 213 are returned to continue to extract label construction information;
217: jumping sequence obtaining optimal network address and on the basis of shortest path, be again started up data and crawl process, visit Ask a series of target webpages for needing to extract data.
For testing website (good doctor website):
The Comparative result of all kinds of web data extracting methods of table 1.
By above-mentioned table 1 it is found that compared to conventional method, the method for the present invention needs less when extracting Web structural data Host number and process number, less access webpage number and extract number of data, the time crawled is shorter, memory space benefit It is higher with rate, it can be seen that the method for the present invention can quickly extract Web structural data.
The above embodiments are merely illustrative of the technical solutions of the present invention rather than is limited, the ordinary skill of this field Personnel can be with modification or equivalent replacement of the technical solution of the present invention are made, without departing from the spirit and scope of the present invention, this The protection scope of invention should subject to the claims.

Claims (10)

1. a kind of Web structural data rapid extracting method, step include:
Level traverses the website links in target web site page, and records the network address of all websites and corresponding with tabular form Network address jumps sequence;
The network address of target webpage is identified from above-mentioned list, and records corresponding network address and jumps sequence, according to target website to target The sequencing of web page access carries out network address splicing, and the network address for generating access target webpage jumps sequence;
Sequence is jumped according to the optimal network address that greedy random self-adaptive search algorithm finds out access target webpage;
Sequential access target webpage is jumped according to above-mentioned optimal network address, level traverses all data item labels of target pages, looks for To all target data item labels;
Most short label path is found out from above-mentioned all target data item labels;
Sequence is jumped according to above-mentioned optimal network address and shortest path crawls the target data of target webpage.
2. the method as described in claim 1, which is characterized in that before being traversed, to the Web structural data to be extracted Attribute confirmation is carried out, the page-tag of network address, the network address of target webpage and target webpage including confirmation targeted website.
3. the method as described in claim 1, which is characterized in that if do not recorded in the network address of identified target webpage The network address of target webpage jumps sequence, then continues the website links for traversing target web site page.
4. the method as described in claim 1, which is characterized in that use least hops revolution as the net of authentication-access target webpage Location jumps whether sequence has reached optimisation criteria, if not, the network address for continuing to generate optimization jumps sequence, until obtaining access mesh The optimal network address of mark webpage jumps sequence.
5. the method as described in claim 1, which is characterized in that when access target webpage, may have access to all on target webpage Data item recorded, and data item label corresponding to each data item is recorded, data item label according to the layer of expansion Secondary structure is sorted out, and records categorized data item and data item label information.
6. the method as described in claim 1, which is characterized in that the number for counting the target data item label found, if still All target data item labels are not found, then continue the data item label for traversing target pages.
7. the method as described in claim 1, which is characterized in that the method for finding out shortest path is, to all targets found Data item label carries out similitude classification according to the hierarchical structure of expansion, merges same paths using adaptive principle, and by mesh Mark data item label is converted to the format of tree, to generate shortest path.
8. the method as described in claim 1, which is characterized in that one evaluation criteria of setting, more all target data item labels Path length, shortest length is reached into the target data item tag path of the standard as shortest path.
9. a kind of Web structural data quick extraction system, comprising:
Network address logic generation module finds out access target webpage most for concluding the relationship of target webpage in the target website Excellent network address jumps sequence;
Tag path generation module is matched for the page-tag to target webpage, finds out target data item label most Short path;
Process accesses Web page module, for jumping sequential access target webpage according to optimal network address;
Thread extracts data module, for being extracted according to shortest path to the target data on target webpage.
10. system as claimed in claim 9, which is characterized in that further include determining object module, for the target to be extracted Data carry out attribute confirmation, network address, the network address of target webpage, the page-tag of target webpage including confirming targeted website.
CN201910021939.9A 2019-01-10 2019-01-10 Method and system for rapidly extracting Web structured data Active CN109948018B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910021939.9A CN109948018B (en) 2019-01-10 2019-01-10 Method and system for rapidly extracting Web structured data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910021939.9A CN109948018B (en) 2019-01-10 2019-01-10 Method and system for rapidly extracting Web structured data

Publications (2)

Publication Number Publication Date
CN109948018A true CN109948018A (en) 2019-06-28
CN109948018B CN109948018B (en) 2021-05-25

Family

ID=67007933

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910021939.9A Active CN109948018B (en) 2019-01-10 2019-01-10 Method and system for rapidly extracting Web structured data

Country Status (1)

Country Link
CN (1) CN109948018B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111274515A (en) * 2020-03-24 2020-06-12 湘潭大学 Webpage data extraction method based on html tag matching

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130198240A1 (en) * 2012-01-27 2013-08-01 Qatar Foundation Social Network Analysis
CN106777281A (en) * 2016-12-29 2017-05-31 深圳市华傲数据技术有限公司 For improving web crawlers stability, the data processing method of availability and device
CN106775611A (en) * 2016-09-05 2017-05-31 中国人民财产保险股份有限公司 The implementation method of the self adaptation dynamic web page crawler system based on machine learning
CN108009202A (en) * 2017-11-01 2018-05-08 昆明理工大学 A kind of Web page classifying sequence dynamic reptile method based on viterbi algorithm

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130198240A1 (en) * 2012-01-27 2013-08-01 Qatar Foundation Social Network Analysis
CN106775611A (en) * 2016-09-05 2017-05-31 中国人民财产保险股份有限公司 The implementation method of the self adaptation dynamic web page crawler system based on machine learning
CN106777281A (en) * 2016-12-29 2017-05-31 深圳市华傲数据技术有限公司 For improving web crawlers stability, the data processing method of availability and device
CN108009202A (en) * 2017-11-01 2018-05-08 昆明理工大学 A kind of Web page classifying sequence dynamic reptile method based on viterbi algorithm

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
朱庆生等: "《一种基于链接和内容分析的自适应主题爬虫算法》", 《计算机与现代化》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111274515A (en) * 2020-03-24 2020-06-12 湘潭大学 Webpage data extraction method based on html tag matching

Also Published As

Publication number Publication date
CN109948018B (en) 2021-05-25

Similar Documents

Publication Publication Date Title
KR102036419B1 (en) Multi-level caching method for improving graph processing performance, and multi-level caching system
CN104794228B (en) A kind of search result provides method and device
CN105491327B (en) A kind of video frequency tracking method and device based on road network
US10185771B2 (en) Method and system for scheduling web crawlers according to keyword search
CN104331446A (en) Memory map-based mass data preprocessing method
CN103077250B (en) A kind of capturing webpage contents method and device
CN109104421B (en) Website content tampering detection method, device, equipment and readable storage medium
CN106033428B (en) The selection method of uniform resource locator and the selection device of uniform resource locator
CN103258017B (en) A kind of parallel square crossing network data acquisition method and system
CN103020123A (en) Method for searching bad video website
CN109547294B (en) Networking equipment model detection method and device based on firmware analysis
CN106484815B (en) A kind of automatic identification optimization method based on mass data class SQL retrieval scene
CN109669858A (en) Test method, device and the equipment of program deadlock
CN106844553A (en) Data snooping and extending method and device based on sample data
CN105335245B (en) Failed storage method and apparatus, trouble shoot method and apparatus
CN111061972A (en) AC searching optimization method and device for URL path matching
CN109948018A (en) A kind of Web structural data rapid extracting method and system
CN103455491A (en) Method and device for classifying search terms
CN108334549A (en) A kind of device data storage method, extracting method, storage platform and extraction platform
CN103077196B (en) A kind of access method from public network WEB website to intranet data storehouse
CN107590233B (en) File management method and device
CN109857924A (en) A kind of big data analysis monitor information processing system and method
Guo Research on web data mining based on topic crawler
CN103440454A (en) Search engine keyword-based active honeypot detection method
CN108009219A (en) A kind of method for finding internet finance public sentiment regulatory target

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant