CN109948018B

CN109948018B - Method and system for rapidly extracting Web structured data

Info

Publication number: CN109948018B
Application number: CN201910021939.9A
Authority: CN
Inventors: 刘芳作; 陈薇; 王腾蛟; 李强; 邱镇; 崔迎宝
Original assignee: Peking University; State Grid Corp of China SGCC; State Grid Information and Telecommunication Co Ltd; State Grid Zhejiang Electric Power Co Ltd
Current assignee: Peking University; State Grid Corp of China SGCC; State Grid Information and Telecommunication Co Ltd; State Grid Zhejiang Electric Power Co Ltd
Priority date: 2019-01-10
Filing date: 2019-01-10
Publication date: 2021-05-25
Anticipated expiration: 2039-01-10
Also published as: CN109948018A

Abstract

The invention relates to a method for quickly extracting Web structured data, which comprises the following steps: traversing website links in a target website page in a hierarchical manner, and recording websites and corresponding website jump sequences of all websites in a list form; identifying the website of the target webpage from the list, recording a corresponding website jumping sequence, splicing the websites according to the sequence from the target website to the target webpage, and generating the website jumping sequence for accessing the target webpage; finding out an optimal website skipping sequence for accessing a target webpage according to a greedy random self-adaptive search algorithm; accessing a target webpage according to the optimal website skipping sequence, and traversing all data item tags of the target webpage in a hierarchical manner to find all target data item tags; finding out the shortest label path from all the target data item labels; and crawling target data of the target webpage according to the optimal website skipping sequence and the shortest path. The invention also provides a system for rapidly extracting the Web structured data.

Description

Method and system for rapidly extracting Web structured data

Technical Field

The invention relates to the field of website webpage structures and file access algorithms, in particular to a webpage structured data extraction method using a self-adaptive algorithm and label data matching, which is related to webpage data crawling and can accurately and quickly extract specified extracted webpage data items.

Background

The existing web page data acquisition modes mainly include two modes, one mode is to acquire a whole web page and then extract data locally, and the other mode is to directly extract page target data and store the data locally.

The method comprises the steps of obtaining a whole web page, then extracting data locally, namely accessing all target pages of a target website, and storing all information of the target pages after feedback of the target pages is obtained. And after all the page information is stored in the local storage space, local filtering and extracting the target data items to be extracted. The mode has the advantages of strong compatibility and stable performance, belongs to the industrialized idea, and is the mainstream mode currently used by various large companies, such as 360 and Jingdongyun. The compatibility lies in that the method is easy to realize distributed page access and acquisition, can run thousands of hosts to simultaneously access the target web pages of the target website, belongs to a brute force algorithm, does not need excessive website path optimization, and takes the number of distributed hosts as compensation. And each host can simultaneously request to access different target pages, so that the time for loading the target pages is saved, the method is suitable for different target websites without changing, and the method has good compatibility. The stability is that in this way all web pages where the data item is located have been downloaded locally before the target data item is extracted. The logic of the structure and fields of the data items to be extracted can be processed locally. And after the required workers mark the target data item locally, the local server performs unified calculation processing to obtain the target data. The disadvantage of this approach is that the threshold for data extraction using this approach is high, requiring a large number of clusters of machines, high-speed bandwidth, large amount of local storage space, and strong computational power. The method is also relatively serious in resource waste, and is mainly embodied in that a large amount of webpage data irrelevant to target data occupies bandwidth and traffic to be downloaded locally, and a large amount of storage space is also provided locally to store a large amount of webpage information irrelevant to extracted data.

The target data of the directly extracted page is stored locally, namely data filtering is directly performed in the process of accessing the page, and the target data is stored locally after being extracted once, such as a script frame and the like. In the method, the target webpage is opened only after the server enters the crawling website, and the target information is extracted from the page information of the loading target webpage. The target data of the directly extracted page is stored locally, network bandwidth and local storage space are utilized highly, but the compatibility and stability of the extracted data are inferior to those of the first mode, and the extraction logic of the data items of different structures of the web pages is different because the extraction of the target data items is completed in the process of loading the page. Access can be supported in the case of the same page structure, and in the case of a different page structure, for example, different websites, the web page data extraction logic needs to be reconstructed. Since the extraction of the target data item depends on the loading of the web page, once the loading of the web page fails, the extraction of the data item also fails, so that the extraction performance is unstable. In addition, the page target data is directly extracted and stored locally, and the webpage loading delay and the page data filtering time are overlapped on the extraction time of the single data, so that the extraction speed of the single data is low.

In summary, the above two methods have the problems of bandwidth traffic waste, local storage space waste, low compatibility, low stability, and the like, and these problems are not merged and improved with each other.

Disclosure of Invention

The invention aims at the problem of webpage data extraction, overcomes the defects of the existing corresponding method, and provides a method and a system for quickly extracting Web structured data.

In order to solve the technical problems, the technical scheme adopted by the invention is as follows:

a method for rapidly extracting Web structured data comprises the following steps:

1) the method comprises the steps of traversing website links in a target website page to be accessed hierarchically, and recording all websites and website jump sequences (namely website logic) in a list form;

2) identifying the website of the target webpage from the list;

3) determining the position of each website jump sequence list of the website where the target webpage is located through matching;

4) accessing a target webpage in the website jump sequence list for multiple times, and recording an optimal website jump sequence according to a greedy random self-adaptive search algorithm;

5) accessing a target webpage according to the optimal website skipping sequence, and traversing all data item tags of the target webpage in a hierarchical manner;

6) finding all target data item labels;

7) comparing the plurality of target data item labels to find out the shortest path;

8) and crawling target data of the target webpage according to the optimal website skipping sequence and the shortest path.

A Web structured data rapid extraction system comprises:

the method comprises a target determining module, a target extracting module and a target extracting module, wherein the target determining module is used for performing attribute confirmation on target data to be extracted, and comprises the steps of confirming a website of a target website, a website of a target webpage and a page tag of the target webpage;

the website logic generation module is used for inducing the relation of the target webpage in the target website and finding out the optimal website skip sequence for accessing the target webpage;

the tag path generation module is used for matching the page tags of the target webpage and finding out the shortest path of the tags of the target data items;

the process access webpage module is used for accessing a target webpage according to the optimal website skipping sequence;

and the thread extraction data module is used for extracting the target data on the target webpage according to the shortest path.

The method comprises the steps of on the basis of a given determined target website, a target website and target data, analyzing a website structure of the target website by accessing website pages tentatively, then inducing a website jump sequence among the target websites through a self-adaptive algorithm, automatically matching a page data label of a webpage and a data item label of the target data after entering the target webpage, then searching a shortest path of the target data item label, and extracting the target data. Through the optimization of the website jumping sequence and the optimization of the data extraction path, the target webpage can be obtained first and then the data is extracted locally or the target data of the target webpage is directly extracted and stored locally, so that the data extraction efficiency of the method is far higher than that of the two traditional modes.

Drawings

FIG. 1 is a block diagram of a Web structured data fast extraction system in an embodiment.

Fig. 2 is a flowchart of a method for quickly extracting Web structured data in an embodiment.

Detailed Description

In order to make the aforementioned and other features and advantages of the invention more comprehensible, embodiments accompanied with figures are described in detail below.

Fig. 1 is a framework diagram of a rapid extraction system for Web structured data according to this embodiment, and specifically includes the following modules:

102: the target determining module is responsible for determining relevant attributes of target data to be extracted, and comprises determining a website of a target website, a website of a target webpage and a page tag of the target webpage to be crawled;

103: the website logic generation module is responsible for analyzing the structure of the target website, inducing the relation of the target webpage in the target website, giving the optimal website skip sequence for accessing the target webpage and conveniently and quickly accessing the target webpage;

104: the tag path generation module is responsible for matching page tags of the target webpage, finding the shortest path of the tags of the target data item and conveniently and quickly extracting the target data;

105: the process access webpage module is responsible for making an access request for a target webpage after the process obtains the optimal website jump sequence and loading an access page;

106: and the thread extraction data module is responsible for extracting the target data on the target webpage according to the shortest path given by the label path generation module by the thread on the basis of loading the target webpage by the process.

Fig. 2 is a flowchart of a method for quickly extracting Web structured data according to this embodiment, and includes the following steps:

202: under the condition of giving a specific format, the format can be customized by a user, for example, https is used as a dividing keyword for direct splicing, expansion is supported, a website of a target website and a website of a target webpage are determined according to the corresponding format, and the two websites are spliced and input in sequence;

203: the method comprises the steps of traversing website links in a target website page in a hierarchical mode, and recording websites in the website and visited website jumping sequences in a list mode;

204: identifying the website of the target webpage in the list, and recording the corresponding website skipping sequence;

205: judging whether at least one website address jumping sequence of the target webpage is recorded in the identified websites of the target webpage, if so, entering 206, otherwise, returning 203 to continue traversing the target website and recording the website address jumping sequence;

206: directly splicing the websites into character strings according to the sequence from the target website to the target webpage access in the recorded website jump sequence of the target webpage to generate a website jump sequence for accessing the target webpage;

207: selecting and combining the shortest website jump sequence for accessing the target webpage by adopting a greedy random self-adaptive search algorithm, taking the shortest website jump sequence as the best website jump sequence for accessing the target webpage from the target website, and recording the website jump sequence;

208: testing the generated website jump sequence, using the least jump number as verification website jump sequence whether reaching the optimization standard, entering 209 if reaching, otherwise returning 207 to continue generating the optimized website jump sequence;

209: carrying out an access request on a target webpage and loading page information of the target webpage;

210: and performing structure classification on the page information of the target webpage, namely recording all accessible data items on the target webpage, recording a data item label corresponding to each data item, and classifying the data item labels according to an expanded hierarchical structure. Recording the classified data items and the label information of the data items as the preprocessing of the traversal target data;

211: traversing all recorded data item information, searching a target data item, and recording all tags corresponding to the target data, wherein a set of all data item tags owned by each data item is tag structure information;

212: counting the number of the target data item labels in the recorded target data item labels, entering 213 if all the target data item labels are found, and returning 211 to continuously traverse the label data if the target data item labels are missing after searching;

213: extracting the structure information of the label from the recorded label data, and recording the structure information of the label;

214: classifying the label structure information again in the recorded label structure information, namely classifying the label structure information according to the similarity of a hierarchical structure expanded by the data item labels;

215: selecting a merged label structure by using a self-adaptive principle from the classified label structure information, merging the same paths, converting the label structure into a tree structure format, and generating a shortest path;

216: comparing, testing and verifying the generated shortest path by using an evaluation standard with the shallowest depth (namely path length) of the label structure information, wherein the standard is achieved by depth comparison among the shortest paths, if the shortest paths reach the test standard, the step is carried out at 217, otherwise, the step is returned to 213 to continue extracting the label structure information;

217: and starting the data crawling process again on the basis of obtaining the optimal website skipping sequence and the shortest path, and accessing a series of target webpages needing data extraction.

Take the test website (good doctor website) as an example:

TABLE 1 comparison of results of various web data extraction methods

As can be seen from table 1, compared with the conventional method, the method of the present invention requires fewer hosts and processes when extracting the Web structured data, fewer pages of access Web and data extraction, shorter crawl time, and higher utilization rate of storage space, so that the method of the present invention can extract the Web structured data faster.

The above embodiments are only intended to illustrate the technical solution of the present invention and not to limit the same, and a person skilled in the art can modify the technical solution of the present invention or substitute the same without departing from the spirit and scope of the present invention, and the scope of the present invention should be determined by the claims.

Claims

1. A method for rapidly extracting Web structured data comprises the following steps:

traversing website links in a target website page in a hierarchical manner, and recording websites and corresponding website jump sequences of all websites in a list form;

identifying the website of the target webpage from the list, recording a corresponding website jumping sequence, splicing the websites according to the sequence from the target website to the target webpage, and generating the website jumping sequence for accessing the target webpage;

finding out an optimal website skipping sequence for accessing a target webpage according to a greedy random self-adaptive search algorithm;

accessing a target webpage according to the optimal website skipping sequence, and traversing all data item tags of the target webpage in a hierarchical manner to find all target data item tags;

similarity classification is carried out on all the found target data item labels according to an expanded hierarchical structure, the same paths are merged by using a self-adaptive principle, the target data item labels are converted into a tree-shaped structure format, the path lengths of all the target data item labels are compared through an evaluation standard, and the target data item label path with the shortest length reaching the standard is used as the shortest path;

and crawling target data of the target webpage according to the optimal website skipping sequence and the shortest path.

2. The method of claim 1, wherein prior to performing the traversal, performing attribute validation on the Web structured data to be extracted includes validating a Web address of the target Web site, a Web address of the target Web page, and a page tag of the target Web page.

3. The method of claim 1, wherein if the web address jump sequence of the target web page is not recorded in the web address of the identified target web page, the traversal of the web address link of the target web page is continued.

4. The method as claimed in claim 1, wherein the minimum number of jumping revolutions is used as verification whether the website jumping order of the accessed target webpage has reached the optimization standard, and if not, the optimized website jumping order is continuously generated until the optimal website jumping order of the accessed target webpage is obtained.

5. The method of claim 1, wherein when accessing the target web page, all accessible data items on the target web page are recorded, data item tags corresponding to each data item are recorded, the data item tags are classified according to an expanded hierarchical structure, and the classified data items and data item tag information are recorded.

6. The method of claim 1, wherein the number of target data item tags found is counted, and if all target data item tags have not been found, then continuing to traverse the data item tags of the target page.

7. A Web structured data rapid extraction system comprises:

the label path generating module is used for matching page labels of the target webpage, performing similarity classification on all the found target data item labels according to an expanded hierarchical structure, merging the same paths by using a self-adaptive principle, converting the target data item labels into a tree-shaped structure format, comparing the path lengths of all the target data item labels through an evaluation standard, and taking the target data item label path with the shortest length reaching the standard as the shortest path;

8. The system of claim 7, further comprising a target determination module for performing attribute validation on the target data to be extracted, including validating a web address of the target website, a web address of the target web page, and a page tag of the target web page.