CN109948018B - Method and system for rapidly extracting Web structured data - Google Patents

Method and system for rapidly extracting Web structured data Download PDF

Info

Publication number
CN109948018B
CN109948018B CN201910021939.9A CN201910021939A CN109948018B CN 109948018 B CN109948018 B CN 109948018B CN 201910021939 A CN201910021939 A CN 201910021939A CN 109948018 B CN109948018 B CN 109948018B
Authority
CN
China
Prior art keywords
target
website
data item
webpage
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910021939.9A
Other languages
Chinese (zh)
Other versions
CN109948018A (en
Inventor
刘芳作
陈薇
王腾蛟
李强
邱镇
崔迎宝
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University
State Grid Corp of China SGCC
State Grid Information and Telecommunication Co Ltd
State Grid Zhejiang Electric Power Co Ltd
Original Assignee
Peking University
State Grid Corp of China SGCC
State Grid Information and Telecommunication Co Ltd
State Grid Zhejiang Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University, State Grid Corp of China SGCC, State Grid Information and Telecommunication Co Ltd, State Grid Zhejiang Electric Power Co Ltd filed Critical Peking University
Priority to CN201910021939.9A priority Critical patent/CN109948018B/en
Publication of CN109948018A publication Critical patent/CN109948018A/en
Application granted granted Critical
Publication of CN109948018B publication Critical patent/CN109948018B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The invention relates to a method for quickly extracting Web structured data, which comprises the following steps: traversing website links in a target website page in a hierarchical manner, and recording websites and corresponding website jump sequences of all websites in a list form; identifying the website of the target webpage from the list, recording a corresponding website jumping sequence, splicing the websites according to the sequence from the target website to the target webpage, and generating the website jumping sequence for accessing the target webpage; finding out an optimal website skipping sequence for accessing a target webpage according to a greedy random self-adaptive search algorithm; accessing a target webpage according to the optimal website skipping sequence, and traversing all data item tags of the target webpage in a hierarchical manner to find all target data item tags; finding out the shortest label path from all the target data item labels; and crawling target data of the target webpage according to the optimal website skipping sequence and the shortest path. The invention also provides a system for rapidly extracting the Web structured data.

Description

Method and system for rapidly extracting Web structured data
Technical Field
The invention relates to the field of website webpage structures and file access algorithms, in particular to a webpage structured data extraction method using a self-adaptive algorithm and label data matching, which is related to webpage data crawling and can accurately and quickly extract specified extracted webpage data items.
Background
The existing web page data acquisition modes mainly include two modes, one mode is to acquire a whole web page and then extract data locally, and the other mode is to directly extract page target data and store the data locally.
The method comprises the steps of obtaining a whole web page, then extracting data locally, namely accessing all target pages of a target website, and storing all information of the target pages after feedback of the target pages is obtained. And after all the page information is stored in the local storage space, local filtering and extracting the target data items to be extracted. The mode has the advantages of strong compatibility and stable performance, belongs to the industrialized idea, and is the mainstream mode currently used by various large companies, such as 360 and Jingdongyun. The compatibility lies in that the method is easy to realize distributed page access and acquisition, can run thousands of hosts to simultaneously access the target web pages of the target website, belongs to a brute force algorithm, does not need excessive website path optimization, and takes the number of distributed hosts as compensation. And each host can simultaneously request to access different target pages, so that the time for loading the target pages is saved, the method is suitable for different target websites without changing, and the method has good compatibility. The stability is that in this way all web pages where the data item is located have been downloaded locally before the target data item is extracted. The logic of the structure and fields of the data items to be extracted can be processed locally. And after the required workers mark the target data item locally, the local server performs unified calculation processing to obtain the target data. The disadvantage of this approach is that the threshold for data extraction using this approach is high, requiring a large number of clusters of machines, high-speed bandwidth, large amount of local storage space, and strong computational power. The method is also relatively serious in resource waste, and is mainly embodied in that a large amount of webpage data irrelevant to target data occupies bandwidth and traffic to be downloaded locally, and a large amount of storage space is also provided locally to store a large amount of webpage information irrelevant to extracted data.
The target data of the directly extracted page is stored locally, namely data filtering is directly performed in the process of accessing the page, and the target data is stored locally after being extracted once, such as a script frame and the like. In the method, the target webpage is opened only after the server enters the crawling website, and the target information is extracted from the page information of the loading target webpage. The target data of the directly extracted page is stored locally, network bandwidth and local storage space are utilized highly, but the compatibility and stability of the extracted data are inferior to those of the first mode, and the extraction logic of the data items of different structures of the web pages is different because the extraction of the target data items is completed in the process of loading the page. Access can be supported in the case of the same page structure, and in the case of a different page structure, for example, different websites, the web page data extraction logic needs to be reconstructed. Since the extraction of the target data item depends on the loading of the web page, once the loading of the web page fails, the extraction of the data item also fails, so that the extraction performance is unstable. In addition, the page target data is directly extracted and stored locally, and the webpage loading delay and the page data filtering time are overlapped on the extraction time of the single data, so that the extraction speed of the single data is low.
In summary, the above two methods have the problems of bandwidth traffic waste, local storage space waste, low compatibility, low stability, and the like, and these problems are not merged and improved with each other.
Disclosure of Invention
The invention aims at the problem of webpage data extraction, overcomes the defects of the existing corresponding method, and provides a method and a system for quickly extracting Web structured data.
In order to solve the technical problems, the technical scheme adopted by the invention is as follows:
a method for rapidly extracting Web structured data comprises the following steps:
1) the method comprises the steps of traversing website links in a target website page to be accessed hierarchically, and recording all websites and website jump sequences (namely website logic) in a list form;
2) identifying the website of the target webpage from the list;
3) determining the position of each website jump sequence list of the website where the target webpage is located through matching;
4) accessing a target webpage in the website jump sequence list for multiple times, and recording an optimal website jump sequence according to a greedy random self-adaptive search algorithm;
5) accessing a target webpage according to the optimal website skipping sequence, and traversing all data item tags of the target webpage in a hierarchical manner;
6) finding all target data item labels;
7) comparing the plurality of target data item labels to find out the shortest path;
8) and crawling target data of the target webpage according to the optimal website skipping sequence and the shortest path.
A Web structured data rapid extraction system comprises:
the method comprises a target determining module, a target extracting module and a target extracting module, wherein the target determining module is used for performing attribute confirmation on target data to be extracted, and comprises the steps of confirming a website of a target website, a website of a target webpage and a page tag of the target webpage;
the website logic generation module is used for inducing the relation of the target webpage in the target website and finding out the optimal website skip sequence for accessing the target webpage;
the tag path generation module is used for matching the page tags of the target webpage and finding out the shortest path of the tags of the target data items;
the process access webpage module is used for accessing a target webpage according to the optimal website skipping sequence;
and the thread extraction data module is used for extracting the target data on the target webpage according to the shortest path.
The method comprises the steps of on the basis of a given determined target website, a target website and target data, analyzing a website structure of the target website by accessing website pages tentatively, then inducing a website jump sequence among the target websites through a self-adaptive algorithm, automatically matching a page data label of a webpage and a data item label of the target data after entering the target webpage, then searching a shortest path of the target data item label, and extracting the target data. Through the optimization of the website jumping sequence and the optimization of the data extraction path, the target webpage can be obtained first and then the data is extracted locally or the target data of the target webpage is directly extracted and stored locally, so that the data extraction efficiency of the method is far higher than that of the two traditional modes.
Drawings
FIG. 1 is a block diagram of a Web structured data fast extraction system in an embodiment.
Fig. 2 is a flowchart of a method for quickly extracting Web structured data in an embodiment.
Detailed Description
In order to make the aforementioned and other features and advantages of the invention more comprehensible, embodiments accompanied with figures are described in detail below.
Fig. 1 is a framework diagram of a rapid extraction system for Web structured data according to this embodiment, and specifically includes the following modules:
102: the target determining module is responsible for determining relevant attributes of target data to be extracted, and comprises determining a website of a target website, a website of a target webpage and a page tag of the target webpage to be crawled;
103: the website logic generation module is responsible for analyzing the structure of the target website, inducing the relation of the target webpage in the target website, giving the optimal website skip sequence for accessing the target webpage and conveniently and quickly accessing the target webpage;
104: the tag path generation module is responsible for matching page tags of the target webpage, finding the shortest path of the tags of the target data item and conveniently and quickly extracting the target data;
105: the process access webpage module is responsible for making an access request for a target webpage after the process obtains the optimal website jump sequence and loading an access page;
106: and the thread extraction data module is responsible for extracting the target data on the target webpage according to the shortest path given by the label path generation module by the thread on the basis of loading the target webpage by the process.
Fig. 2 is a flowchart of a method for quickly extracting Web structured data according to this embodiment, and includes the following steps:
202: under the condition of giving a specific format, the format can be customized by a user, for example, https is used as a dividing keyword for direct splicing, expansion is supported, a website of a target website and a website of a target webpage are determined according to the corresponding format, and the two websites are spliced and input in sequence;
203: the method comprises the steps of traversing website links in a target website page in a hierarchical mode, and recording websites in the website and visited website jumping sequences in a list mode;
204: identifying the website of the target webpage in the list, and recording the corresponding website skipping sequence;
205: judging whether at least one website address jumping sequence of the target webpage is recorded in the identified websites of the target webpage, if so, entering 206, otherwise, returning 203 to continue traversing the target website and recording the website address jumping sequence;
206: directly splicing the websites into character strings according to the sequence from the target website to the target webpage access in the recorded website jump sequence of the target webpage to generate a website jump sequence for accessing the target webpage;
207: selecting and combining the shortest website jump sequence for accessing the target webpage by adopting a greedy random self-adaptive search algorithm, taking the shortest website jump sequence as the best website jump sequence for accessing the target webpage from the target website, and recording the website jump sequence;
208: testing the generated website jump sequence, using the least jump number as verification website jump sequence whether reaching the optimization standard, entering 209 if reaching, otherwise returning 207 to continue generating the optimized website jump sequence;
209: carrying out an access request on a target webpage and loading page information of the target webpage;
210: and performing structure classification on the page information of the target webpage, namely recording all accessible data items on the target webpage, recording a data item label corresponding to each data item, and classifying the data item labels according to an expanded hierarchical structure. Recording the classified data items and the label information of the data items as the preprocessing of the traversal target data;
211: traversing all recorded data item information, searching a target data item, and recording all tags corresponding to the target data, wherein a set of all data item tags owned by each data item is tag structure information;
212: counting the number of the target data item labels in the recorded target data item labels, entering 213 if all the target data item labels are found, and returning 211 to continuously traverse the label data if the target data item labels are missing after searching;
213: extracting the structure information of the label from the recorded label data, and recording the structure information of the label;
214: classifying the label structure information again in the recorded label structure information, namely classifying the label structure information according to the similarity of a hierarchical structure expanded by the data item labels;
215: selecting a merged label structure by using a self-adaptive principle from the classified label structure information, merging the same paths, converting the label structure into a tree structure format, and generating a shortest path;
216: comparing, testing and verifying the generated shortest path by using an evaluation standard with the shallowest depth (namely path length) of the label structure information, wherein the standard is achieved by depth comparison among the shortest paths, if the shortest paths reach the test standard, the step is carried out at 217, otherwise, the step is returned to 213 to continue extracting the label structure information;
217: and starting the data crawling process again on the basis of obtaining the optimal website skipping sequence and the shortest path, and accessing a series of target webpages needing data extraction.
Take the test website (good doctor website) as an example:
TABLE 1 comparison of results of various web data extraction methods
Figure BDA0001941072240000051
As can be seen from table 1, compared with the conventional method, the method of the present invention requires fewer hosts and processes when extracting the Web structured data, fewer pages of access Web and data extraction, shorter crawl time, and higher utilization rate of storage space, so that the method of the present invention can extract the Web structured data faster.
The above embodiments are only intended to illustrate the technical solution of the present invention and not to limit the same, and a person skilled in the art can modify the technical solution of the present invention or substitute the same without departing from the spirit and scope of the present invention, and the scope of the present invention should be determined by the claims.

Claims (8)

1. A method for rapidly extracting Web structured data comprises the following steps:
traversing website links in a target website page in a hierarchical manner, and recording websites and corresponding website jump sequences of all websites in a list form;
identifying the website of the target webpage from the list, recording a corresponding website jumping sequence, splicing the websites according to the sequence from the target website to the target webpage, and generating the website jumping sequence for accessing the target webpage;
finding out an optimal website skipping sequence for accessing a target webpage according to a greedy random self-adaptive search algorithm;
accessing a target webpage according to the optimal website skipping sequence, and traversing all data item tags of the target webpage in a hierarchical manner to find all target data item tags;
similarity classification is carried out on all the found target data item labels according to an expanded hierarchical structure, the same paths are merged by using a self-adaptive principle, the target data item labels are converted into a tree-shaped structure format, the path lengths of all the target data item labels are compared through an evaluation standard, and the target data item label path with the shortest length reaching the standard is used as the shortest path;
and crawling target data of the target webpage according to the optimal website skipping sequence and the shortest path.
2. The method of claim 1, wherein prior to performing the traversal, performing attribute validation on the Web structured data to be extracted includes validating a Web address of the target Web site, a Web address of the target Web page, and a page tag of the target Web page.
3. The method of claim 1, wherein if the web address jump sequence of the target web page is not recorded in the web address of the identified target web page, the traversal of the web address link of the target web page is continued.
4. The method as claimed in claim 1, wherein the minimum number of jumping revolutions is used as verification whether the website jumping order of the accessed target webpage has reached the optimization standard, and if not, the optimized website jumping order is continuously generated until the optimal website jumping order of the accessed target webpage is obtained.
5. The method of claim 1, wherein when accessing the target web page, all accessible data items on the target web page are recorded, data item tags corresponding to each data item are recorded, the data item tags are classified according to an expanded hierarchical structure, and the classified data items and data item tag information are recorded.
6. The method of claim 1, wherein the number of target data item tags found is counted, and if all target data item tags have not been found, then continuing to traverse the data item tags of the target page.
7. A Web structured data rapid extraction system comprises:
the website logic generation module is used for inducing the relation of the target webpage in the target website and finding out the optimal website skip sequence for accessing the target webpage;
the label path generating module is used for matching page labels of the target webpage, performing similarity classification on all the found target data item labels according to an expanded hierarchical structure, merging the same paths by using a self-adaptive principle, converting the target data item labels into a tree-shaped structure format, comparing the path lengths of all the target data item labels through an evaluation standard, and taking the target data item label path with the shortest length reaching the standard as the shortest path;
the process access webpage module is used for accessing a target webpage according to the optimal website skipping sequence;
and the thread extraction data module is used for extracting the target data on the target webpage according to the shortest path.
8. The system of claim 7, further comprising a target determination module for performing attribute validation on the target data to be extracted, including validating a web address of the target website, a web address of the target web page, and a page tag of the target web page.
CN201910021939.9A 2019-01-10 2019-01-10 Method and system for rapidly extracting Web structured data Active CN109948018B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910021939.9A CN109948018B (en) 2019-01-10 2019-01-10 Method and system for rapidly extracting Web structured data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910021939.9A CN109948018B (en) 2019-01-10 2019-01-10 Method and system for rapidly extracting Web structured data

Publications (2)

Publication Number Publication Date
CN109948018A CN109948018A (en) 2019-06-28
CN109948018B true CN109948018B (en) 2021-05-25

Family

ID=67007933

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910021939.9A Active CN109948018B (en) 2019-01-10 2019-01-10 Method and system for rapidly extracting Web structured data

Country Status (1)

Country Link
CN (1) CN109948018B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111274515A (en) * 2020-03-24 2020-06-12 湘潭大学 Webpage data extraction method based on html tag matching

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106777281A (en) * 2016-12-29 2017-05-31 深圳市华傲数据技术有限公司 For improving web crawlers stability, the data processing method of availability and device
CN106775611A (en) * 2016-09-05 2017-05-31 中国人民财产保险股份有限公司 The implementation method of the self adaptation dynamic web page crawler system based on machine learning
CN108009202A (en) * 2017-11-01 2018-05-08 昆明理工大学 A kind of Web page classifying sequence dynamic reptile method based on viterbi algorithm

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2498762A (en) * 2012-01-27 2013-07-31 Qatar Foundation Computing user traffic at the website based on user actions

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106775611A (en) * 2016-09-05 2017-05-31 中国人民财产保险股份有限公司 The implementation method of the self adaptation dynamic web page crawler system based on machine learning
CN106777281A (en) * 2016-12-29 2017-05-31 深圳市华傲数据技术有限公司 For improving web crawlers stability, the data processing method of availability and device
CN108009202A (en) * 2017-11-01 2018-05-08 昆明理工大学 A kind of Web page classifying sequence dynamic reptile method based on viterbi algorithm

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
《一种基于链接和内容分析的自适应主题爬虫算法》;朱庆生等;《计算机与现代化》;20151021(第241期);77-89 *

Also Published As

Publication number Publication date
CN109948018A (en) 2019-06-28

Similar Documents

Publication Publication Date Title
US10452691B2 (en) Method and apparatus for generating search results using inverted index
RU2686590C1 (en) Method and device for comparing similar elements of high-dimensional image features
US9317613B2 (en) Large scale entity-specific resource classification
US8533203B2 (en) Identifying synonyms of entities using a document collection
Yagoubi et al. Dpisax: Massively distributed partitioned isax
JP2010501096A (en) Cooperative optimization of wrapper generation and template detection
US10558707B2 (en) Method for discovering relevant concepts in a semantic graph of concepts
CN104331446A (en) Memory map-based mass data preprocessing method
JP2022073981A (en) Source code retrieval
CN110263104B (en) JSON character string processing method and device
CN103324886B (en) A kind of extracting method of fingerprint database in network intrusion detection and system
CN111324577A (en) Method and device for reading and writing Yml file
CN114356893A (en) Metadata tuning method, device, equipment and storage medium based on machine learning
CN114491172A (en) Method, device and equipment for quickly searching tree structure nodes and storage medium
Bramandia et al. On incremental maintenance of 2-hop labeling of graphs
CN114116762A (en) Offline data fuzzy search method, device, equipment and medium
CN109948018B (en) Method and system for rapidly extracting Web structured data
KR20130088584A (en) Pattern mining method for searching tree on top-down traversal for considering weight in data stream
JP6624062B2 (en) Information processing apparatus, information processing method, and program
CN106844553A (en) Data snooping and extending method and device based on sample data
KR20120115005A (en) Method and apparatus for processing query efficiently
CN111061972A (en) AC searching optimization method and device for URL path matching
CN105824976A (en) Method and device for optimizing word segmentation banks
CN106776654B (en) Data searching method and device
US8533135B2 (en) Model generating device and model generating method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant