CN106777281B

CN106777281B - Data processing method and device for improving stability and usability of web crawler

Info

Publication number: CN106777281B
Application number: CN201611243842.5A
Authority: CN
Inventors: 张军; 贾西贝
Original assignee: Shenzhen Huaao Data Technology Co Ltd
Current assignee: Shenzhen Huaao Data Technology Co Ltd
Priority date: 2016-12-29
Filing date: 2016-12-29
Publication date: 2020-07-17
Anticipated expiration: 2036-12-29
Also published as: CN106777281A

Abstract

The invention relates to a data processing method and device for improving stability and usability of a web crawler. The method provided by the invention comprises the following steps: step S1, judging whether the current page has local structural change according to the pre-designated characteristics; step S2, if no structural change occurs, acquiring the structural layout of the current page, and analyzing the content in the current page according to the structural layout of the current page; step S3, according to the pre-configured mapping rule, the service field name obtained by parsing is mapped adaptively and stored in the storage area. The data processing method and the data processing device for improving the stability and the usability of the web crawler can automatically identify the non-structural change of the web page, adopt the self-adaptive data extraction logic and do not need frequent maintenance.

Description

Data processing method and device for improving stability and usability of web crawler

Technical Field

The invention relates to the technical field of data processing, in particular to a data processing method and device for improving stability and usability of a web crawler.

Background

With the popularization and development of the internet, various types of information such as e-commerce websites, portal websites, blogs, microblogs and the like are published on the internet, and people can collect mass information through the internet and analyze and count the information to acquire required information.

In the existing method, a web crawler technology is adopted to acquire information and remove binary contents such as pictures, videos and the like, the web crawler generally acquires webpage text contents, and a traditional crawler analyzes the information by using a regular expression, an xpath or a position.

However, there is a problem in that the web page is dynamically changed, such as: the location of the service field name/field value, tag id of html, xpath path may change at any time. The dynamic characteristics of the web page determine the frequent maintenance characteristics of the web crawler, so that the existing web crawler has poor universality and high maintenance cost.

Disclosure of Invention

Aiming at the defects in the prior art, the data processing method and the data processing device for improving the stability and the usability of the web crawler can automatically identify the non-structural change of the web page, adopt the self-adaptive data extraction logic and do not need frequent maintenance.

In a first aspect, the present invention provides a data processing method for improving stability and availability of a web crawler, including: step S1, judging whether the current page has local structural change according to the pre-designated characteristics; step S2, if no structural change occurs, acquiring the structural layout of the current page, and analyzing the content in the current page according to the structural layout of the current page; step S3, according to the pre-configured mapping rule, the service field name obtained by parsing is mapped adaptively and stored in the storage area.

The data processing method for improving the stability and the usability of the web crawler can automatically identify the non-structural change of the web page, adopts self-adaptive data extraction logic, does not need frequent maintenance, saves the cost, improves the stability of web data crawling, and has better universality.

Preferably, the step S1 includes: and comparing the characteristics specified in advance with the corresponding labels of the current page one by one, and if the characteristics are inconsistent with the corresponding labels of the current page, determining that the current page has local structural change.

Preferably, the step S2 includes acquiring an HTM L file of the current page, extracting content in a Table tag and content in a div tag from the HTM L file, acquiring a structural layout of the current page according to the content in the Table tag, analyzing the content according to the structural layout of the current page, acquiring the structural layout of the current page according to the content in the div tag, and analyzing the content according to the structural layout of the current page.

Preferably, the obtaining the structural layout of the current page according to the content in the Table tag and analyzing the content according to the structural layout of the current page includes: detecting a header portion in the Table tag; extracting multi-dimensional information of the header-removed part in the Table label; judging the structural layout according to the extracted multi-dimensional information; and acquiring service data according to the structural layout.

Preferably, the obtaining the structural layout of the current page according to the content in the div tag and analyzing the content according to the structural layout of the current page includes: and acquiring the label matched with the known service field name from the div label, judging the structural layout according to the position of the matched label in the div label, and acquiring service data according to the structural layout.

In a second aspect, the present invention provides a data processing apparatus for improving stability and availability of a web crawler, including: the structural change detection module is used for judging whether the current page has local structural change according to the characteristics specified in advance; the analysis module is used for acquiring the structural layout of the current page if structural change does not occur, and analyzing the content in the current page according to the structural layout of the current page; and the field self-adaptive adjusting module is used for carrying out self-adaptive mapping on the service field names acquired by analysis according to a preset mapping rule and storing the service field names into a storage area.

The data processing device for improving the stability and the usability of the web crawler can automatically identify the non-structural change of the web page, adopts self-adaptive data extraction logic, does not need frequent maintenance, saves the cost, improves the stability of web data crawling, and has better universality.

Preferably, the structural change detection module is specifically configured to: and comparing the characteristics specified in advance with the corresponding labels of the current page one by one, and if the characteristics are inconsistent with the corresponding labels of the current page, determining that the current page has local structural change.

Preferably, the parsing module is specifically configured to obtain an HTM L file of the current page, extract content in a Table tag and content in a div tag from the HTM L file, obtain a structural layout of the current page according to the content in the Table tag, parse the content according to the structural layout of the current page, obtain the structural layout of the current page according to the content in the div tag, and parse the content according to the structural layout of the current page.

Preferably, in the parsing module, obtaining the structural layout of the current page according to the content in the Table tag, and parsing the content according to the structural layout of the current page includes: detecting a header portion in the Table tag; extracting multi-dimensional information of the header-removed part in the Table label; judging the structural layout according to the extracted multi-dimensional information; and acquiring service data according to the structural layout.

Preferably, in the parsing module, obtaining the structural layout of the current page according to the content in the div tag, and parsing the content according to the structural layout of the current page includes: and acquiring the label matched with the known service field name from the div label, judging the structural layout according to the position of the matched label in the div label, and acquiring service data according to the structural layout.

Drawings

FIG. 1 is a flow chart of a data processing method for improving the stability and availability of a web crawler according to an embodiment of the present invention;

FIG. 2 is a layout of a header section, a remark section, and a business data section in an exemplary table;

FIG. 3 is an example of a vertical multiple T L layout;

FIG. 4 is an example of a transverse multiple T L layout;

FIG. 5 is an example of cut merging for a table of multiple T L layouts;

FIG. 6 is an example of cut merging for a table of multiple T L layouts;

FIG. 7 is an example of processing a table for a single T L (multi-level) layout;

fig. 8 is a block diagram of a data processing apparatus for improving stability and usability of a web crawler according to an embodiment of the present invention.

Detailed Description

Embodiments of the present invention will be described in detail below with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and therefore are only examples, and the protection scope of the present invention is not limited thereby.

It is to be noted that, unless otherwise specified, technical or scientific terms used herein shall have the ordinary meaning as understood by those skilled in the art to which the invention pertains.

The table in the web page is defined by the < table > tag in the HTM L, the row of the table is defined by the < tr > tag, the < tr > must be inside one < table > </table > and cannot be used separately, each row is divided into a plurality of cells, each cell is defined by the < td > tag, the < td > needs to be nested in the middle of the < tr > </tr > and also needs to be nested in the < tr >, the < th > </th > is used to define the head cell, and the head information is contained.

The table of the code displayed in the web page is as follows:

name (I)	Age (age)
		Zhang three	40

The div tag in the HTM L is used to define a partition or section in the document (division/section) < div > tag that can divide the document into separate, distinct parts.

The embodiment provides a data processing method for improving stability and availability of a web crawler, as shown in fig. 1, including:

in step S1, it is determined whether or not the current page has a local structural change based on the previously specified feature.

The structural change refers to that the structural layout of the page is changed, for example, a certain label is not seen, the attribute of the certain label is changed, or the number of rows and columns of the Table is changed.

Step S2, if no structural change occurs, obtaining the structural layout of the current page, and analyzing the content in the current page according to the structural layout of the current page.

Step S3, according to the pre-configured mapping rule, the service field name obtained by parsing is mapped adaptively and stored in the storage area.

The service field name refers to a title name of each service data, such as "law execution institute", "execution case number" in fig. 2. The self-adaptive mapping means that the service field names acquired by analysis are replaced by preset standard fields so as to uniformly extract the service field names of the data, and the management and the statistics of the subsequent data are facilitated. For example, the "enterprise name" and the "organization name" are automatically mapped to the "company name" of the storage layer.

The data processing method for improving the stability and the usability of the web crawler can automatically identify the non-structural change of the web page, adopts the self-adaptive data extraction logic, does not need frequent maintenance, saves the cost, improves the stability of web data crawling, and has better universality.

Wherein, step S1 specifically includes: and comparing the pre-specified features with the corresponding tags of the current page one by one, and if the features are inconsistent with the corresponding tags of the current page, determining that the current page has local structural change.

The HTM L may include various types of tags, such as Table tag and div tag, the extraction method of different tags is different, and in order to be able to adapt to the hybrid HTM L, the step S2 specifically includes:

in step S21, the HTM L file of the current page is obtained.

Step S22, extracting the content in the Table tag and the content in the div tag from the HTM L file.

Step S23, obtaining the structural layout of the current page according to the content in the Table tag, and analyzing the content according to the structural layout of the current page.

Step S24, obtaining the structural layout of the current page according to the content in the div tag, and parsing the content according to the structural layout of the current page.

The form on the web page is edited by means of HTM L Table tag, most of these information is semi-structured data, although the display effect on the page is regular, but the bottom layer tag and data are not regular, even very chaotic, which causes the title part to be mixed with the business data, and the business data cannot be extracted quickly and accurately.

In step S231, the header portion in the Table tag is detected.

As shown in fig. 2, the header portion of the table is usually a large merge cell, which may be one or more rows, and the table may further include a remark portion, where the structure of the remark portion is similar to that of the header portion, and the remaining portion is the service data to be extracted except the header portion and the remark portion of the table. When the remark section is present in the table, the remark section needs to be detected in step S1 in the same manner as the header section.

Step S232, extracting the multi-dimensional information of the header-removed part in the Table label.

Wherein the multi-dimensional information comprises: direct content, th/td distribution, class attribute distribution, background-color attribute distribution, etc. The direct content is the content directly displayed in the table in the web page, i.e. the text content in the < table > tag, such as "name", "age", "zhang san", "40". the th/td distribution refers to the distribution location of the th and td tags in this table. The class attribute specifies the class name of the element in the cell, and the class attribute distribution refers to the distribution position of the class attribute in the table. The background-color attribute specifies the background color of the cell, and the background-color attribute distribution refers to the distribution position of the background-color attribute in the table.

Step S233, determining a structural layout according to the extracted multidimensional information.

The common table layout is divided into horizontal single T L, horizontal multiple T L, vertical single T L, vertical multiple T L, and multiple table combination, where T L (Title L ine) is a column header (or data header part) (possibly physically multiple rows, but logically one area), and a Title representing each service data, such as the first row of the service data part in fig. 2 and T L, T L can be horizontal or vertical, such as the vertical multiple T L layout shown in fig. 3, and fig. 4 is the horizontal multiple T L layout.

Step S234, acquiring the service data according to the structure layout.

Step S23 provides a method for adaptively extracting structured information in an HTM L Table label, which comprises the steps of firstly detecting a header part in the Table label, excluding contents which do not belong to a business data part, and preventing useless data from being mixed, then extracting multi-dimensional information of the header part in the Table label, and comprehensively judging the structural layout of the Table according to the multi-dimensional information, wherein the information in the Table label can reflect the Table layout, so that no matter how the Table in a webpage is changed, a new Table layout can be obtained by analyzing the information in the Table label.

The title part and the remark part are generally in the first row or the second row of the table, and are a merged cell, so the specific implementation manner of step S231 includes: detecting whether each line is a merging cell in a Table label, if so, detecting the detected line belongs to a title part, and detecting the next line; if not, indicating that the line is beginning to be traffic data, then the detection of the header portion is stopped. For example, the code for the title and remarks sections is generally of the form:

< tr > < td colspan ═ 5' >2016 demographic Table </td > </tr >

The code contains only a < td > tag and colspan ═ 5' indicates that it is a merged cell, and by detecting < td > and colspan, the title part and remark part can be identified.

In the prior art, when filtering useless data (such as a header part and a remark part), the positions of the useless data need to be known in advance, and then the positions are specified in a program so as to skip the previous rows of the useless data. The method of the embodiment has universality, and no matter how many lines of the title part and the remark part exist in the table, the lines of the title part and the remark part can be accurately and efficiently detected, so that the business data can be accurately extracted.

In the process of extracting data, besides directly acquiring the corresponding information of the conventional cell, special processing needs to be performed on the merged cell to enable the extracted data to meet the storage format, so as to facilitate subsequent processing, therefore, the preferred mode of step S232 includes: extracting multi-dimensional information of a header part (including a remark part if the remark part exists) in the Table label, splitting a merged cell in the extracted information, storing the information of each dimension in a two-dimensional array form respectively, and marking the split cell specially.

The merging cells are divided into horizontal merging (colspan), vertical merging (rowspan) and mixed merging (colspan + rowspan). For example: after extracting direct content from < td colspan ═ 5' bgcolor ═ F7FBFE "> ABC ^ td ]:

ABC

{←}

wherein, the special mark "{ ← }" is specific to extracting direct content, and indicates that the content in the cell is the same as the content in the cell on the left side thereof, so as to provide flexibility for the processing and final content output of T L, and the extraction of other data does not need to be specially marked.

Extracting 'background-color attribute distribution' as follows:

#F7FBFE

for the case where there are multiple lateral merges (colspan) in a single row, the problem of coordinate translation also needs to be noted. For example, < td colspan ═ 2' > ABC [ < td >

ABC

{←}

DEF

{←}

Similar methods are also used for data extraction for longitudinal merging (rowspan) and mixed merging (colspan + rowspan).

Only if the table layout is known, the business data can be accurately extracted, and the table is converted into the structured data according to the table layout. The judgment of the table layout in step S233 includes the following operations:

(1) rows and columns that are not T L are excluded, depending on the straightforward content of the extraction.

The exclusivity judgment is carried out according to the data type, the length and the keywords of the direct content in T L, the judgment is carried out according to the following conditions that the length of the field Name in each cell of T L cannot exceed a threshold (for example, 50), the number of the field names of T L cannot exceed the threshold (for example, 1000), the field names cannot be pure digital character strings, common field names comprise keywords such as 'Name', 'Address', 'type', 'remark' and the like, a keyword library is obtained according to common table statistics, and whether the rows or the columns contain the keywords in the keyword library is detected.

Therefore, the table layout judgment based on the direct content is realized by detecting the extracted direct content row by row and column by column, if the data type of the direct content is a digital character string, the row or the column of the direct content is not T L, if the field length of the direct content exceeds a first threshold value, the row or the column of the direct content is not T L, and if a plurality of items of direct content in the row or the column contain a given keyword, the row or the column is T L.

When the keyword-based determination method is used, at least two keywords are required to appear to identify the row or column as T L in order to ensure the reliability of the determination.

(2) And judging the table layout according to the extracted background-color attribute distribution.

When the table is displayed, in order to provide convenience for a user to read, the background color of the table T L may be different from the background color of the data, or the odd and even rows of the data may adopt staggered background colors, so that the background-color attribute distribution may be used to determine which rows or columns may be T L, and further determine whether the table layout is horizontal or vertical.

(3) And judging the table layout according to the extracted class attribute distribution.

Cells with the same class attribute are typically homogeneous cells. If the class attributes of all the row cells are the same, the table layout is a horizontal layout; if the class attributes of all the row cells are the same, the table layout is a vertical layout, and therefore, whether the row cells are horizontal or vertical can be judged according to the distribution of the class attributes.

(4) And judging the table layout according to whether the data types of the direct contents in the same row or the same column are the same.

Except for the T L part, the data of cells under each field name of T L should be the same in type as long as they are not null (the method can only distinguish between 'pure numeric string', 'date-time string', 'no obvious character string'). for example, the table in fig. 2 is a horizontal layout in which the data types of cells in each column are the same except for the first row T L, for example, the column of the field name "serial number" is a pure numeric string, the column of the field name "execution court" is a 'no obvious character string', the column of the field name "execution case" is a 'no obvious character string', and in short, the data types of the columns are the same except for the T L row.

According to the characteristics, whether the data types of the same row are the same or not is detected, and if the data types of all the rows of the table are the same (namely, the data types of all the cells in the same row are either 'pure numeric character strings' or 'date-time character strings' or 'no obvious characteristic character strings'), the table is in a longitudinal layout; and detecting whether the data types of the same column are the same or not, and the data types of all columns of the table are the same (namely, the data types of all cells in the same column are 'pure numeric character strings', or 'date-time character strings' or 'no obvious characteristic character strings'), so that the table is in a horizontal layout.

In order to avoid the influence of the cells on the detection result, the cells with empty contents do not fall into the detection range when the rows and the columns are detected.

The data volume of the business data part of the table is generally large, and the detection on all rows and columns can reduce the judgment efficiency, so that short circuit judgment can be adopted, namely, if the judgment result of a new row can deny a certain layout, the judgment can be skipped.

(5) The table layout is judged according to the th/td distribution.

The number of cells of T L is less than or equal to the number of cells of other rows, the number of cells of non-T L should be uniform, the number of cells of all rows and columns is counted according to th/td distribution, the row or column with obviously less number of cells may be T L, and the number of T L can be used to obtain the table layout according to whether T L is horizontal or vertical.

th is generally used to define a title, and corresponds to field names such as 'name', 'age'. the layout of th may also have differences between horizontal and vertical, for example, the horizontal layout is < colspan ═ 3 '> achievement list </th >, and the vertical layout is < throwspan ═ 3' > achievement list </th >.

td can be used to define a general cell and also can be used to define a title.

And judging the table layout according to th distribution when the th label and the td label exist at the same time. But many tables will not specify th, at which time the table layout is judged from the td distribution.

In addition, the method of the embodiment can identify the condition of multiple T L in the table and improve the reliability of extracted data.

In the case where the table layout is a vertical layout, it is also necessary to shift the table formed by the direct content to a horizontal layout.

T L is divided into two types, i.e., a single-stage T L and a multi-stage T L, but they are collectively referred to as T L unless otherwise specified, as shown in fig. 2, only one T L and a single-stage T L, as shown in fig. 7, only one T L and a multi-stage T L (composed of a plurality of rows with upper and lower levels of membership) are combined to form a single-row field name output, as shown in fig. 7, T L in the original table is divided into two parts, the left part is a plurality of rows (multi-stage) and the right part is a single row, the first stage of the multi-stage part is a combined cell, the field name is 'basic information', the second stage of the multi-stage part is 'name', 'age', 'gender', and finally a single-stage T L is output, and has a structure of "basic information _ name", "basic information _ age", "basic information _ gender", "other field a" and "other field B".

For the case that the table layout is multiple T L, the table formed by the direct content needs to be cut and combined to be converted into the layout of a single T L so as to meet the format requirement of the structured data, wherein the cut and combined operation comprises the steps of comparing the direct contents of multiple T L, only reserving one row of T L of the T L with the same content, as shown in FIG. 5, and splicing the T L with different contents into one row of T L, as shown in FIG. 6.

Finally, for the merged cells, the special mark can be corrected according to the service requirement. For example

ABC

{←}

The following formats can be adjusted:

ABC

the method for extracting the structured information in the HTM L Table tags in the self-adaptive mode is an extraction method for a single Table, when a plurality of Table tags (a plurality of tables) exist in a webpage, the method for extracting the structured information in the HTM L Table tags in the self-adaptive mode is only needed to be repeatedly used, the Table corresponding to each Table tag is extracted, and then the extraction results are combined according to a preset rule.

For the data extraction of the div layout, step S24 specifically includes: and acquiring the label matched with the known service field name from the div label, judging the structural layout according to the position of the label in the div label, and acquiring service data according to the structural layout.

L abel is a field name in a div label, such as "name", "age" and "sex" in example one, and tables with div layout in example two are extracted from the div label according to the known field name, and in example one, the extracted labels are all in the labels on the right side, the structural layout of the tables can be determined to be a left-right key value layout (vertical layout), and in example two, the extracted labels are all in the labels on a row, the structural layout of the tables can be determined to be an upper-lower layout (horizontal layout).

Example 1

< div > < div > name </div > < three </div >

< div > < div > age </div > < div >18</div >

Sex < div > male </div >

Example two

< div > < div > name </div > < div > age </div > < div > sex </div >

< div > < div >18</div > < men </div >, a pharmaceutical composition containing the same, and a method for producing the same

Based on the same inventive concept as the data processing method for improving the stability and the availability of the web crawler, the present embodiment further provides a data processing apparatus for improving the stability and the availability of the web crawler, as shown in fig. 8, including: the structural change detection module is used for judging whether the current page has local structural change according to the characteristics specified in advance; the analysis module is used for acquiring the structural layout of the current page if structural change does not occur, and analyzing the content in the current page according to the structural layout of the current page; and the field self-adaptive adjusting module is used for carrying out self-adaptive mapping on the service field names acquired by analysis according to a preset mapping rule and storing the service field names into a storage area.

Further, the structural change detection module is specifically configured to: and comparing the pre-specified features with the corresponding tags of the current page one by one, and if the features are inconsistent with the corresponding tags of the current page, determining that the current page has local structural change.

The parsing module is further specifically configured to obtain an HTM L file of the current page, extract content in a Table tag and content in a div tag from the HTM L file, obtain a structural layout of the current page according to the content in the Table tag, parse the content according to the structural layout of the current page, obtain the structural layout of the current page according to the content in the div tag, and parse the content according to the structural layout of the current page.

Further, in the parsing module, obtaining the structural layout of the current page according to the content in the Table tag, and parsing the content according to the structural layout of the current page includes: detecting a header portion in a Table tag; extracting multi-dimensional information of the header-removed part in the Table label; judging the structural layout according to the extracted multidimensional information; and acquiring the service data according to the structural layout.

Further, in the parsing module, obtaining the structural layout of the current page according to the content in the div tag, and parsing the content according to the structural layout of the current page includes: and acquiring the label matched with the known service field name from the div label, judging the structural layout according to the position of the matched label in the div label, and acquiring service data according to the structural layout.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit and scope of the present invention, and they should be construed as being included in the following claims and description.

Claims

1. A data processing method for improving web crawler stability and availability, comprising:

step S1, judging whether the current page has local structural change according to the pre-designated characteristics;

step S2, if no structural change occurs, acquiring the structural layout of the current page, and analyzing the content in the current page according to the structural layout of the current page;

step S3, according to the preset mapping rule, the business field name obtained by analysis is mapped in a self-adapting way and stored in the storage area;

the self-adaptive mapping is to replace the service field names obtained by analysis with preset standard fields so as to uniformly extract the service field names of the data;

the step S2 includes:

acquiring an HTM L file of the current page;

extracting the content in a Table label and the content in a div label from the HTM L file;

acquiring the structural layout of the current page according to the content in the Table tag, and analyzing the content according to the structural layout of the current page;

acquiring the structural layout of the current page according to the content in the div tag, and analyzing the content according to the structural layout of the current page;

the obtaining the structural layout of the current page according to the content in the Table tag and analyzing the content according to the structural layout of the current page includes:

detecting a header portion in the Table tag;

extracting multi-dimensional information of the header-removed part in the Table label;

judging the structural layout according to the extracted multi-dimensional information;

and acquiring service data according to the structural layout.

2. The method according to claim 1, wherein the step S1 includes: and comparing the characteristics specified in advance with the corresponding labels of the current page one by one, and if the characteristics are inconsistent with the corresponding labels of the current page, determining that the current page has local structural change.

3. The method of claim 1, wherein the obtaining the structural layout of the current page according to the content in the div tag and parsing the content according to the structural layout of the current page comprises: and acquiring the label matched with the known service field name from the div label, judging the structural layout according to the position of the matched label in the div label, and acquiring service data according to the structural layout.

4. A data processing apparatus for improving web crawler stability, availability, comprising:

the structural change detection module is used for judging whether the current page has local structural change according to the characteristics specified in advance;

the analysis module is used for acquiring the structural layout of the current page if structural change does not occur, and analyzing the content in the current page according to the structural layout of the current page;

the field self-adaptive adjusting module is used for carrying out self-adaptive mapping on the service field names acquired through analysis according to a preset mapping rule and storing the service field names into a storage area;

the analysis module is specifically configured to:

acquiring an HTM L file of the current page;

in the analyzing module, acquiring the structural layout of the current page according to the content in the Table tag, and analyzing the content according to the structural layout of the current page includes:

detecting a header portion in the Table tag;

and acquiring service data according to the structural layout.

5. The apparatus of claim 4, wherein the structural change detection module is specifically configured to: and comparing the characteristics specified in advance with the corresponding labels of the current page one by one, and if the characteristics are inconsistent with the corresponding labels of the current page, determining that the current page has local structural change.

6. The apparatus of claim 4, wherein the parsing module obtains the structural layout of the current page according to the content in the div tag, and parses the content according to the structural layout of the current page, and the parsing module includes: and acquiring the label matched with the known service field name from the div label, judging the structural layout according to the position of the matched label in the div label, and acquiring service data according to the structural layout.