CN109948018A - A kind of Web structural data rapid extracting method and system - Google Patents
A kind of Web structural data rapid extracting method and system Download PDFInfo
- Publication number
- CN109948018A CN109948018A CN201910021939.9A CN201910021939A CN109948018A CN 109948018 A CN109948018 A CN 109948018A CN 201910021939 A CN201910021939 A CN 201910021939A CN 109948018 A CN109948018 A CN 109948018A
- Authority
- CN
- China
- Prior art keywords
- target
- network address
- data item
- target webpage
- webpage
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Landscapes
- Information Transfer Between Computers (AREA)
Abstract
The present invention is a kind of Web structural data rapid extracting method, and step includes: the website links in level traversal target web site page, and records the network address of all websites with tabular form and corresponding network address jumps sequence;The network address of target webpage is identified from above-mentioned list, and records corresponding network address and jumps sequence, and the sequencing that according to target website to target webpage accesses carries out network address splicing, and the network address for generating access target webpage jumps sequence;Sequence is jumped according to the optimal network address that greedy random self-adaptive search algorithm finds out access target webpage;Sequential access target webpage is jumped according to above-mentioned optimal network address, level traverses all data item labels of target pages, finds all target data item labels;Most short label path is found out from above-mentioned all target data item labels;Sequence is jumped according to above-mentioned optimal network address and shortest path crawls the target data of target webpage.The present invention also provides a kind of Web structural data quick extraction systems.
Description
Technical field
The present invention crawls related to web data, is related to website and webpage structure and file access algorithm field, is a kind of benefit
It, can be to the web data item of specified extraction with adaptive algorithm and the matched Web page structural data extraction method of label data
Progress is precisely rapidly extracted.
Background technique
Existing web data acquisition modes are mainly divided to two kinds, and one is first acquisition the whole network pages, and data to be locally extracted again,
Another kind is that direct page target data of extracting are stored in local.
It first obtains the whole network page and data is locally extracted again, i.e., access for all target pages in targeted website,
All information of target pages are stored after obtaining the feedback of target pages.The storage of all page infos is to being locally stored sky
Between after, then local filter extraction is carried out to the target data item extracted of needs.The advantages of this mode, is compatible strong, performance
Stablize, belong to industrialization thought, is current each major company's main way currently in use, such as 360, Jingdone district cloud etc..Compatibility
It is this mode distributed page access easy to accomplish and acquisition, thousands of hosts can be run simultaneously to the mesh of targeted website
Mark webpage accesses, and belongs to brute-force algorithm, does not need excessive url-path optimization, is the quantity conduct of host in a distributed manner
Compensation.And every host can request access to different target pages simultaneously, both save the time of load target webpage in this way
It can be applicable in different target website again, without changing, so having compatibility well.Stability is that this mode exists
Before extracting target data item, all page downloads where data item local has been arrived.The data item for needing to extract
The logics such as structure and field can be in processing locality.Have the staff of demand locally marked target data item and then
Unified calculation processing is carried out by local server, obtains target data.The defect of this mode is to be counted using which
It is higher according to the threshold of extraction, need a large amount of clusters of machines, the bandwidth of high speed, a large amount of local storage space and powerful calculation
Power.This mode, also than more serious, is mainly reflected in a large amount of and unrelated target data web data all to the waste of resource
It wants occupied bandwidth and flow locally downloading, and locally also to provide a large amount of memory space to store a large amount of and extraction data
Unrelated Webpage information.
It directly extracts page target data and is stored in local, i.e., directly carry out data filtering during accession page,
Local, such as scrapy frame etc. are stored in after disposable extraction target data.This mode only needs server entrance to crawl
Target webpage is opened behind website, is that target information is extracted in page info in load target webpage.Directly extract page target number
According to local is stored in, network bandwidth and local storage space are utilized higher, but in the compatibility and stabilization for extracting data
Property on but be not so good as first way, since the extraction of target data item is completed during loading page, so different
The webpage of structure is also different to the extraction logic of data item.It can support to access in the case where same page structure, not
In the case where with page structure, such as different web sites, it is necessary to rebuild web data extraction logic.Due to target data
The extraction of item depends on the load of webpage, once webpage load failure, the extraction of data item also will fail, so extraction performance is not
Stablize.In addition, directly extracting page target data is stored in local, webpage load delay and page data filtration time are all superimposed
On the extraction time of single data, so that the extraction rate of single data is slower.
In conclusion there are bandwidth traffics to waste, local storage space waste is low with compatibility, stablizes for both the above method
The problems such as property is low, and these problems are not all mutually merged and are improved.
Summary of the invention
The present invention overcomes deficiency existing for above-mentioned existing correlation method, proposes one kind aiming at the problem that web data extracts
Web structural data rapid extracting method and system utilize adaptive algorithm and the matched Web page structural data of label data
Extracting method to the web data item of specified extraction can precisely rapidly extract.
In order to solve the above technical problems, the technical solution adopted by the present invention are as follows:
A kind of Web structural data rapid extracting method, step include:
1) website links in the level traversal target web site page to be accessed, and all nets are recorded in the form of a list
Location and network address jump sequence (i.e. network address logic);
2) network address of target webpage is identified from above-mentioned list;
3) determine that each network address of target webpage place website jumps the position in sequence list by matching;
4) it carries out repeatedly access network address and jumps the target webpage in sequence list, according to the random self-adaptive search algorithm of greed
It records optimal network address and jumps sequence;
5) sequential access target webpage is jumped according to above-mentioned optimal network address, level traverses all data item marks of target pages
Label;
6) all target data item labels are found;
7) multiple target data item labels are compared, shortest path is found out;
8) sequence is jumped according to above-mentioned optimal network address and shortest path crawls the target data of target webpage.
A kind of Web structural data quick extraction system, comprising:
Including determining object module, for carrying out attribute confirmation, including confirmation targeted website to the target data to be extracted
Network address, the network address of target webpage, the page-tag of target webpage;
Network address logic generation module finds out access target webpage for concluding the relationship of target webpage in the target website
Optimal network address jump sequence;
Tag path generation module is matched for the page-tag to target webpage, finds out target data item label
Shortest path;
Process accesses Web page module, for jumping sequential access target webpage according to optimal network address;
Thread extracts data module, for being extracted according to shortest path to the target data on target webpage.
The present invention is on the basis of providing determining targeted website, target network address and target data, by heuristically visiting
It asks Website page, parses the website structure of targeted website, target network in targeted website is then summarized by adaptive algorithm
Network address between location jumps sequence, and after entering target webpage, the page data label and number of targets of Auto-matching webpage
According to data item label, then search out the shortest path of target data item label, carry out the extraction of target data.Pass through network address
The optimization of sequence and the optimization of data extraction path are jumped, target webpage can be first obtained and data is locally extracted again or directly mention
The target data of target webpage is taken to be stored in local, so that the present invention is much higher than two kinds of traditional sides in data extraction efficiency
Formula.
Detailed description of the invention
Fig. 1 is one of embodiment Web structural data quick extraction system frame diagram.
Fig. 2 is one of embodiment Web structural data rapid extracting method flow chart.
Specific embodiment
To enable features described above and advantage of the invention to be clearer and more comprehensible, special embodiment below, and institute's attached drawing is cooperated to make
Detailed description are as follows.
Fig. 1 is a kind of frame diagram for Web structural data quick extraction system that the present embodiment proposes, is specifically included as follows
Module:
102: determining object module, be responsible for carrying out the target data to be extracted the confirmation of association attributes, including confirmation needs
Crawl the network address of targeted website, the network address of target webpage, the page-tag of target webpage;
103: network address logic generation module is responsible for the parsing to targeted website structure, summarizes target webpage in target network
Relationship in standing, the optimal network address for providing access target webpage jump sequence, and fast and easy enters target webpage;
104: tag path generation module is responsible for matching the page-tag of target webpage, finds target data item
The shortest path of label, fast and easy extract target data;
105: process accesses Web page module, is responsible for process after obtaining optimal network address and jumping sequence, mentions to target webpage
Access request out, and load accession page;
106: thread extracts data module, is responsible on the basis of process loads target webpage, thread is according to tag path
The shortest path that generation module provides extracts the target data on target webpage.
Fig. 2 is a kind of Web structural data rapid extracting method flow chart that the present embodiment proposes, is included the following steps:
202: under conditions of given specific format, which be can be customized by users, such as using https as division
Keyword carries out direct splicing, supports extension, according to corresponding format, determines the network address of targeted website and the network address of target webpage,
Two network address are sequentially spliced and inputted;
203: level traverses the website links in target web site page, and with the network address inside tabular form record website
Sequence is jumped with the network address of access;
204: identifying the network address of target webpage in above-mentioned list, and record corresponding network address and jump sequence;
205: judging at least one target webpage whether has been recorded in the network address of above-mentioned identified target webpage
Network address jump sequence, if yes then enter 206, otherwise return to 203 and continue to traverse targeted website and record network address to jump sequence;
206: being jumped in sequence in the target webpage network address for having record, the elder generation that according to target website to target webpage accesses
Sequence afterwards, by network address direct splicing at character string, the network address for generating access target webpage jumps sequence;
207: taking greedy random self-adaptive search algorithm to select the network address jump for merging shortest access target webpage out
Turn sequence, jumps sequence as the optimal network address from targeted website access target webpage, and record;
208: testing generated network address and jump sequence, use whether least hops revolution jumps sequence as verifying network address
Reach optimisation criteria, reach and then enter 209, otherwise returns to the network address that 207 continue to generate optimization and jump sequence;
209: accessing request to target webpage, and load the page info of target webpage;
210: structure classification being carried out to the page info of target webpage, i.e., addressable data item all on target webpage
Recorded, and record data item label corresponding to each data item, data item label according to expansion hierarchical structure into
Row is sorted out.And categorized data item and data item label information are recorded, the pretreatment as traversal target data;
211: all data item information that traversal has been recorded search target data item, and record target data institute
Corresponding all labels, the set for all data item labels that each data item is possessed are label construction information;
212: in the target data item label recorded, the number of target data item label is counted, if
All target data item labels are found, then enter 213, if there is missing after searching, 211 is returned and continues to traverse label data;
213: in the label data recorded, extracting the structural information of label, and record label construction information;
214: in the label construction information recorded, label construction information being sorted out again, i.e., according to data
The hierarchical structure of item label expansion carries out similitude classification;
215: in the label construction information sorted out, being chosen using adaptive principle and merge label construction, merged identical
Path, and label construction is converted to the format of tree, generate shortest path;
216: using label construction information depth (i.e. path length) most shallow evaluation criteria to generated shortest path
It is compared test verifying, which is relatively reached by depth from each other, if reaching testing standard, enters 217,
Otherwise 213 are returned to continue to extract label construction information;
217: jumping sequence obtaining optimal network address and on the basis of shortest path, be again started up data and crawl process, visit
Ask a series of target webpages for needing to extract data.
For testing website (good doctor website):
The Comparative result of all kinds of web data extracting methods of table 1.
By above-mentioned table 1 it is found that compared to conventional method, the method for the present invention needs less when extracting Web structural data
Host number and process number, less access webpage number and extract number of data, the time crawled is shorter, memory space benefit
It is higher with rate, it can be seen that the method for the present invention can quickly extract Web structural data.
The above embodiments are merely illustrative of the technical solutions of the present invention rather than is limited, the ordinary skill of this field
Personnel can be with modification or equivalent replacement of the technical solution of the present invention are made, without departing from the spirit and scope of the present invention, this
The protection scope of invention should subject to the claims.
Claims (10)
1. a kind of Web structural data rapid extracting method, step include:
Level traverses the website links in target web site page, and records the network address of all websites and corresponding with tabular form
Network address jumps sequence;
The network address of target webpage is identified from above-mentioned list, and records corresponding network address and jumps sequence, according to target website to target
The sequencing of web page access carries out network address splicing, and the network address for generating access target webpage jumps sequence;
Sequence is jumped according to the optimal network address that greedy random self-adaptive search algorithm finds out access target webpage;
Sequential access target webpage is jumped according to above-mentioned optimal network address, level traverses all data item labels of target pages, looks for
To all target data item labels;
Most short label path is found out from above-mentioned all target data item labels;
Sequence is jumped according to above-mentioned optimal network address and shortest path crawls the target data of target webpage.
2. the method as described in claim 1, which is characterized in that before being traversed, to the Web structural data to be extracted
Attribute confirmation is carried out, the page-tag of network address, the network address of target webpage and target webpage including confirmation targeted website.
3. the method as described in claim 1, which is characterized in that if do not recorded in the network address of identified target webpage
The network address of target webpage jumps sequence, then continues the website links for traversing target web site page.
4. the method as described in claim 1, which is characterized in that use least hops revolution as the net of authentication-access target webpage
Location jumps whether sequence has reached optimisation criteria, if not, the network address for continuing to generate optimization jumps sequence, until obtaining access mesh
The optimal network address of mark webpage jumps sequence.
5. the method as described in claim 1, which is characterized in that when access target webpage, may have access to all on target webpage
Data item recorded, and data item label corresponding to each data item is recorded, data item label according to the layer of expansion
Secondary structure is sorted out, and records categorized data item and data item label information.
6. the method as described in claim 1, which is characterized in that the number for counting the target data item label found, if still
All target data item labels are not found, then continue the data item label for traversing target pages.
7. the method as described in claim 1, which is characterized in that the method for finding out shortest path is, to all targets found
Data item label carries out similitude classification according to the hierarchical structure of expansion, merges same paths using adaptive principle, and by mesh
Mark data item label is converted to the format of tree, to generate shortest path.
8. the method as described in claim 1, which is characterized in that one evaluation criteria of setting, more all target data item labels
Path length, shortest length is reached into the target data item tag path of the standard as shortest path.
9. a kind of Web structural data quick extraction system, comprising:
Network address logic generation module finds out access target webpage most for concluding the relationship of target webpage in the target website
Excellent network address jumps sequence;
Tag path generation module is matched for the page-tag to target webpage, finds out target data item label most
Short path;
Process accesses Web page module, for jumping sequential access target webpage according to optimal network address;
Thread extracts data module, for being extracted according to shortest path to the target data on target webpage.
10. system as claimed in claim 9, which is characterized in that further include determining object module, for the target to be extracted
Data carry out attribute confirmation, network address, the network address of target webpage, the page-tag of target webpage including confirming targeted website.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910021939.9A CN109948018B (en) | 2019-01-10 | 2019-01-10 | Method and system for rapidly extracting Web structured data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910021939.9A CN109948018B (en) | 2019-01-10 | 2019-01-10 | Method and system for rapidly extracting Web structured data |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109948018A true CN109948018A (en) | 2019-06-28 |
CN109948018B CN109948018B (en) | 2021-05-25 |
Family
ID=67007933
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910021939.9A Active CN109948018B (en) | 2019-01-10 | 2019-01-10 | Method and system for rapidly extracting Web structured data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109948018B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111274515A (en) * | 2020-03-24 | 2020-06-12 | 湘潭大学 | Webpage data extraction method based on html tag matching |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130198240A1 (en) * | 2012-01-27 | 2013-08-01 | Qatar Foundation | Social Network Analysis |
CN106777281A (en) * | 2016-12-29 | 2017-05-31 | 深圳市华傲数据技术有限公司 | For improving web crawlers stability, the data processing method of availability and device |
CN106775611A (en) * | 2016-09-05 | 2017-05-31 | 中国人民财产保险股份有限公司 | The implementation method of the self adaptation dynamic web page crawler system based on machine learning |
CN108009202A (en) * | 2017-11-01 | 2018-05-08 | 昆明理工大学 | A kind of Web page classifying sequence dynamic reptile method based on viterbi algorithm |
-
2019
- 2019-01-10 CN CN201910021939.9A patent/CN109948018B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130198240A1 (en) * | 2012-01-27 | 2013-08-01 | Qatar Foundation | Social Network Analysis |
CN106775611A (en) * | 2016-09-05 | 2017-05-31 | 中国人民财产保险股份有限公司 | The implementation method of the self adaptation dynamic web page crawler system based on machine learning |
CN106777281A (en) * | 2016-12-29 | 2017-05-31 | 深圳市华傲数据技术有限公司 | For improving web crawlers stability, the data processing method of availability and device |
CN108009202A (en) * | 2017-11-01 | 2018-05-08 | 昆明理工大学 | A kind of Web page classifying sequence dynamic reptile method based on viterbi algorithm |
Non-Patent Citations (1)
Title |
---|
朱庆生等: "《一种基于链接和内容分析的自适应主题爬虫算法》", 《计算机与现代化》 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111274515A (en) * | 2020-03-24 | 2020-06-12 | 湘潭大学 | Webpage data extraction method based on html tag matching |
Also Published As
Publication number | Publication date |
---|---|
CN109948018B (en) | 2021-05-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
KR102036419B1 (en) | Multi-level caching method for improving graph processing performance, and multi-level caching system | |
CN104794228B (en) | A kind of search result provides method and device | |
CN105491327B (en) | A kind of video frequency tracking method and device based on road network | |
US10185771B2 (en) | Method and system for scheduling web crawlers according to keyword search | |
CN104331446A (en) | Memory map-based mass data preprocessing method | |
CN103077250B (en) | A kind of capturing webpage contents method and device | |
CN109104421B (en) | Website content tampering detection method, device, equipment and readable storage medium | |
CN106033428B (en) | The selection method of uniform resource locator and the selection device of uniform resource locator | |
CN103258017B (en) | A kind of parallel square crossing network data acquisition method and system | |
CN103020123A (en) | Method for searching bad video website | |
CN109547294B (en) | Networking equipment model detection method and device based on firmware analysis | |
CN106484815B (en) | A kind of automatic identification optimization method based on mass data class SQL retrieval scene | |
CN109669858A (en) | Test method, device and the equipment of program deadlock | |
CN106844553A (en) | Data snooping and extending method and device based on sample data | |
CN105335245B (en) | Failed storage method and apparatus, trouble shoot method and apparatus | |
CN111061972A (en) | AC searching optimization method and device for URL path matching | |
CN109948018A (en) | A kind of Web structural data rapid extracting method and system | |
CN103455491A (en) | Method and device for classifying search terms | |
CN108334549A (en) | A kind of device data storage method, extracting method, storage platform and extraction platform | |
CN103077196B (en) | A kind of access method from public network WEB website to intranet data storehouse | |
CN107590233B (en) | File management method and device | |
CN109857924A (en) | A kind of big data analysis monitor information processing system and method | |
Guo | Research on web data mining based on topic crawler | |
CN103440454A (en) | Search engine keyword-based active honeypot detection method | |
CN108009219A (en) | A kind of method for finding internet finance public sentiment regulatory target |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |