CN106777281B - Data processing method and device for improving stability and usability of web crawler - Google Patents

Data processing method and device for improving stability and usability of web crawler Download PDF

Info

Publication number
CN106777281B
CN106777281B CN201611243842.5A CN201611243842A CN106777281B CN 106777281 B CN106777281 B CN 106777281B CN 201611243842 A CN201611243842 A CN 201611243842A CN 106777281 B CN106777281 B CN 106777281B
Authority
CN
China
Prior art keywords
current page
content
structural layout
structural
layout
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201611243842.5A
Other languages
Chinese (zh)
Other versions
CN106777281A (en
Inventor
张军
贾西贝
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Huaao Data Technology Co Ltd
Original Assignee
Shenzhen Huaao Data Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Huaao Data Technology Co Ltd filed Critical Shenzhen Huaao Data Technology Co Ltd
Priority to CN201611243842.5A priority Critical patent/CN106777281B/en
Publication of CN106777281A publication Critical patent/CN106777281A/en
Application granted granted Critical
Publication of CN106777281B publication Critical patent/CN106777281B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a data processing method and device for improving stability and usability of a web crawler. The method provided by the invention comprises the following steps: step S1, judging whether the current page has local structural change according to the pre-designated characteristics; step S2, if no structural change occurs, acquiring the structural layout of the current page, and analyzing the content in the current page according to the structural layout of the current page; step S3, according to the pre-configured mapping rule, the service field name obtained by parsing is mapped adaptively and stored in the storage area. The data processing method and the data processing device for improving the stability and the usability of the web crawler can automatically identify the non-structural change of the web page, adopt the self-adaptive data extraction logic and do not need frequent maintenance.

Description

Data processing method and device for improving stability and usability of web crawler
Technical Field
The invention relates to the technical field of data processing, in particular to a data processing method and device for improving stability and usability of a web crawler.
Background
With the popularization and development of the internet, various types of information such as e-commerce websites, portal websites, blogs, microblogs and the like are published on the internet, and people can collect mass information through the internet and analyze and count the information to acquire required information.
In the existing method, a web crawler technology is adopted to acquire information and remove binary contents such as pictures, videos and the like, the web crawler generally acquires webpage text contents, and a traditional crawler analyzes the information by using a regular expression, an xpath or a position.
However, there is a problem in that the web page is dynamically changed, such as: the location of the service field name/field value, tag id of html, xpath path may change at any time. The dynamic characteristics of the web page determine the frequent maintenance characteristics of the web crawler, so that the existing web crawler has poor universality and high maintenance cost.
Disclosure of Invention
Aiming at the defects in the prior art, the data processing method and the data processing device for improving the stability and the usability of the web crawler can automatically identify the non-structural change of the web page, adopt the self-adaptive data extraction logic and do not need frequent maintenance.
In a first aspect, the present invention provides a data processing method for improving stability and availability of a web crawler, including: step S1, judging whether the current page has local structural change according to the pre-designated characteristics; step S2, if no structural change occurs, acquiring the structural layout of the current page, and analyzing the content in the current page according to the structural layout of the current page; step S3, according to the pre-configured mapping rule, the service field name obtained by parsing is mapped adaptively and stored in the storage area.
The data processing method for improving the stability and the usability of the web crawler can automatically identify the non-structural change of the web page, adopts self-adaptive data extraction logic, does not need frequent maintenance, saves the cost, improves the stability of web data crawling, and has better universality.
Preferably, the step S1 includes: and comparing the characteristics specified in advance with the corresponding labels of the current page one by one, and if the characteristics are inconsistent with the corresponding labels of the current page, determining that the current page has local structural change.
Preferably, the step S2 includes acquiring an HTM L file of the current page, extracting content in a Table tag and content in a div tag from the HTM L file, acquiring a structural layout of the current page according to the content in the Table tag, analyzing the content according to the structural layout of the current page, acquiring the structural layout of the current page according to the content in the div tag, and analyzing the content according to the structural layout of the current page.
Preferably, the obtaining the structural layout of the current page according to the content in the Table tag and analyzing the content according to the structural layout of the current page includes: detecting a header portion in the Table tag; extracting multi-dimensional information of the header-removed part in the Table label; judging the structural layout according to the extracted multi-dimensional information; and acquiring service data according to the structural layout.
Preferably, the obtaining the structural layout of the current page according to the content in the div tag and analyzing the content according to the structural layout of the current page includes: and acquiring the label matched with the known service field name from the div label, judging the structural layout according to the position of the matched label in the div label, and acquiring service data according to the structural layout.
In a second aspect, the present invention provides a data processing apparatus for improving stability and availability of a web crawler, including: the structural change detection module is used for judging whether the current page has local structural change according to the characteristics specified in advance; the analysis module is used for acquiring the structural layout of the current page if structural change does not occur, and analyzing the content in the current page according to the structural layout of the current page; and the field self-adaptive adjusting module is used for carrying out self-adaptive mapping on the service field names acquired by analysis according to a preset mapping rule and storing the service field names into a storage area.
The data processing device for improving the stability and the usability of the web crawler can automatically identify the non-structural change of the web page, adopts self-adaptive data extraction logic, does not need frequent maintenance, saves the cost, improves the stability of web data crawling, and has better universality.
Preferably, the structural change detection module is specifically configured to: and comparing the characteristics specified in advance with the corresponding labels of the current page one by one, and if the characteristics are inconsistent with the corresponding labels of the current page, determining that the current page has local structural change.
Preferably, the parsing module is specifically configured to obtain an HTM L file of the current page, extract content in a Table tag and content in a div tag from the HTM L file, obtain a structural layout of the current page according to the content in the Table tag, parse the content according to the structural layout of the current page, obtain the structural layout of the current page according to the content in the div tag, and parse the content according to the structural layout of the current page.
Preferably, in the parsing module, obtaining the structural layout of the current page according to the content in the Table tag, and parsing the content according to the structural layout of the current page includes: detecting a header portion in the Table tag; extracting multi-dimensional information of the header-removed part in the Table label; judging the structural layout according to the extracted multi-dimensional information; and acquiring service data according to the structural layout.
Preferably, in the parsing module, obtaining the structural layout of the current page according to the content in the div tag, and parsing the content according to the structural layout of the current page includes: and acquiring the label matched with the known service field name from the div label, judging the structural layout according to the position of the matched label in the div label, and acquiring service data according to the structural layout.
Drawings
FIG. 1 is a flow chart of a data processing method for improving the stability and availability of a web crawler according to an embodiment of the present invention;
FIG. 2 is a layout of a header section, a remark section, and a business data section in an exemplary table;
FIG. 3 is an example of a vertical multiple T L layout;
FIG. 4 is an example of a transverse multiple T L layout;
FIG. 5 is an example of cut merging for a table of multiple T L layouts;
FIG. 6 is an example of cut merging for a table of multiple T L layouts;
FIG. 7 is an example of processing a table for a single T L (multi-level) layout;
fig. 8 is a block diagram of a data processing apparatus for improving stability and usability of a web crawler according to an embodiment of the present invention.
Detailed Description
Embodiments of the present invention will be described in detail below with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and therefore are only examples, and the protection scope of the present invention is not limited thereby.
It is to be noted that, unless otherwise specified, technical or scientific terms used herein shall have the ordinary meaning as understood by those skilled in the art to which the invention pertains.
The table in the web page is defined by the < table > tag in the HTM L, the row of the table is defined by the < tr > tag, the < tr > must be inside one < table > </table > and cannot be used separately, each row is divided into a plurality of cells, each cell is defined by the < td > tag, the < td > needs to be nested in the middle of the < tr > </tr > and also needs to be nested in the < tr >, the < th > </th > is used to define the head cell, and the head information is contained.
Figure BDA0001196708360000041
The table of the code displayed in the web page is as follows:
name (I) Age (age)
Zhang three 40
The div tag in the HTM L is used to define a partition or section in the document (division/section) < div > tag that can divide the document into separate, distinct parts.
The embodiment provides a data processing method for improving stability and availability of a web crawler, as shown in fig. 1, including:
in step S1, it is determined whether or not the current page has a local structural change based on the previously specified feature.
The structural change refers to that the structural layout of the page is changed, for example, a certain label is not seen, the attribute of the certain label is changed, or the number of rows and columns of the Table is changed.
Step S2, if no structural change occurs, obtaining the structural layout of the current page, and analyzing the content in the current page according to the structural layout of the current page.
Step S3, according to the pre-configured mapping rule, the service field name obtained by parsing is mapped adaptively and stored in the storage area.
The service field name refers to a title name of each service data, such as "law execution institute", "execution case number" in fig. 2. The self-adaptive mapping means that the service field names acquired by analysis are replaced by preset standard fields so as to uniformly extract the service field names of the data, and the management and the statistics of the subsequent data are facilitated. For example, the "enterprise name" and the "organization name" are automatically mapped to the "company name" of the storage layer.
The data processing method for improving the stability and the usability of the web crawler can automatically identify the non-structural change of the web page, adopts the self-adaptive data extraction logic, does not need frequent maintenance, saves the cost, improves the stability of web data crawling, and has better universality.
Wherein, step S1 specifically includes: and comparing the pre-specified features with the corresponding tags of the current page one by one, and if the features are inconsistent with the corresponding tags of the current page, determining that the current page has local structural change.
The HTM L may include various types of tags, such as Table tag and div tag, the extraction method of different tags is different, and in order to be able to adapt to the hybrid HTM L, the step S2 specifically includes:
in step S21, the HTM L file of the current page is obtained.
Step S22, extracting the content in the Table tag and the content in the div tag from the HTM L file.
Step S23, obtaining the structural layout of the current page according to the content in the Table tag, and analyzing the content according to the structural layout of the current page.
Step S24, obtaining the structural layout of the current page according to the content in the div tag, and parsing the content according to the structural layout of the current page.
The form on the web page is edited by means of HTM L Table tag, most of these information is semi-structured data, although the display effect on the page is regular, but the bottom layer tag and data are not regular, even very chaotic, which causes the title part to be mixed with the business data, and the business data cannot be extracted quickly and accurately.
In step S231, the header portion in the Table tag is detected.
As shown in fig. 2, the header portion of the table is usually a large merge cell, which may be one or more rows, and the table may further include a remark portion, where the structure of the remark portion is similar to that of the header portion, and the remaining portion is the service data to be extracted except the header portion and the remark portion of the table. When the remark section is present in the table, the remark section needs to be detected in step S1 in the same manner as the header section.
Step S232, extracting the multi-dimensional information of the header-removed part in the Table label.
Wherein the multi-dimensional information comprises: direct content, th/td distribution, class attribute distribution, background-color attribute distribution, etc. The direct content is the content directly displayed in the table in the web page, i.e. the text content in the < table > tag, such as "name", "age", "zhang san", "40". the th/td distribution refers to the distribution location of the th and td tags in this table. The class attribute specifies the class name of the element in the cell, and the class attribute distribution refers to the distribution position of the class attribute in the table. The background-color attribute specifies the background color of the cell, and the background-color attribute distribution refers to the distribution position of the background-color attribute in the table.
Step S233, determining a structural layout according to the extracted multidimensional information.
The common table layout is divided into horizontal single T L, horizontal multiple T L, vertical single T L, vertical multiple T L, and multiple table combination, where T L (Title L ine) is a column header (or data header part) (possibly physically multiple rows, but logically one area), and a Title representing each service data, such as the first row of the service data part in fig. 2 and T L, T L can be horizontal or vertical, such as the vertical multiple T L layout shown in fig. 3, and fig. 4 is the horizontal multiple T L layout.
Step S234, acquiring the service data according to the structure layout.
Step S23 provides a method for adaptively extracting structured information in an HTM L Table label, which comprises the steps of firstly detecting a header part in the Table label, excluding contents which do not belong to a business data part, and preventing useless data from being mixed, then extracting multi-dimensional information of the header part in the Table label, and comprehensively judging the structural layout of the Table according to the multi-dimensional information, wherein the information in the Table label can reflect the Table layout, so that no matter how the Table in a webpage is changed, a new Table layout can be obtained by analyzing the information in the Table label.
The title part and the remark part are generally in the first row or the second row of the table, and are a merged cell, so the specific implementation manner of step S231 includes: detecting whether each line is a merging cell in a Table label, if so, detecting the detected line belongs to a title part, and detecting the next line; if not, indicating that the line is beginning to be traffic data, then the detection of the header portion is stopped. For example, the code for the title and remarks sections is generally of the form:
< tr > < td colspan ═ 5' >2016 demographic Table </td > </tr >
The code contains only a < td > tag and colspan ═ 5' indicates that it is a merged cell, and by detecting < td > and colspan, the title part and remark part can be identified.
In the prior art, when filtering useless data (such as a header part and a remark part), the positions of the useless data need to be known in advance, and then the positions are specified in a program so as to skip the previous rows of the useless data. The method of the embodiment has universality, and no matter how many lines of the title part and the remark part exist in the table, the lines of the title part and the remark part can be accurately and efficiently detected, so that the business data can be accurately extracted.
In the process of extracting data, besides directly acquiring the corresponding information of the conventional cell, special processing needs to be performed on the merged cell to enable the extracted data to meet the storage format, so as to facilitate subsequent processing, therefore, the preferred mode of step S232 includes: extracting multi-dimensional information of a header part (including a remark part if the remark part exists) in the Table label, splitting a merged cell in the extracted information, storing the information of each dimension in a two-dimensional array form respectively, and marking the split cell specially.
The merging cells are divided into horizontal merging (colspan), vertical merging (rowspan) and mixed merging (colspan + rowspan). For example: after extracting direct content from < td colspan ═ 5' bgcolor ═ F7FBFE "> ABC ^ td ]:
ABC {←} {←} {←} {←}
wherein, the special mark "{ ← }" is specific to extracting direct content, and indicates that the content in the cell is the same as the content in the cell on the left side thereof, so as to provide flexibility for the processing and final content output of T L, and the extraction of other data does not need to be specially marked.
Extracting 'background-color attribute distribution' as follows:
#F7FBFE #F7FBFE #F7FBFE #F7FBFE #F7FBFE
for the case where there are multiple lateral merges (colspan) in a single row, the problem of coordinate translation also needs to be noted. For example, < td colspan ═ 2' > ABC [ < td >
ABC {←} DEF {←} {←}
Similar methods are also used for data extraction for longitudinal merging (rowspan) and mixed merging (colspan + rowspan).
Only if the table layout is known, the business data can be accurately extracted, and the table is converted into the structured data according to the table layout. The judgment of the table layout in step S233 includes the following operations:
(1) rows and columns that are not T L are excluded, depending on the straightforward content of the extraction.
The exclusivity judgment is carried out according to the data type, the length and the keywords of the direct content in T L, the judgment is carried out according to the following conditions that the length of the field Name in each cell of T L cannot exceed a threshold (for example, 50), the number of the field names of T L cannot exceed the threshold (for example, 1000), the field names cannot be pure digital character strings, common field names comprise keywords such as 'Name', 'Address', 'type', 'remark' and the like, a keyword library is obtained according to common table statistics, and whether the rows or the columns contain the keywords in the keyword library is detected.
Therefore, the table layout judgment based on the direct content is realized by detecting the extracted direct content row by row and column by column, if the data type of the direct content is a digital character string, the row or the column of the direct content is not T L, if the field length of the direct content exceeds a first threshold value, the row or the column of the direct content is not T L, and if a plurality of items of direct content in the row or the column contain a given keyword, the row or the column is T L.
When the keyword-based determination method is used, at least two keywords are required to appear to identify the row or column as T L in order to ensure the reliability of the determination.
(2) And judging the table layout according to the extracted background-color attribute distribution.
When the table is displayed, in order to provide convenience for a user to read, the background color of the table T L may be different from the background color of the data, or the odd and even rows of the data may adopt staggered background colors, so that the background-color attribute distribution may be used to determine which rows or columns may be T L, and further determine whether the table layout is horizontal or vertical.
(3) And judging the table layout according to the extracted class attribute distribution.
Cells with the same class attribute are typically homogeneous cells. If the class attributes of all the row cells are the same, the table layout is a horizontal layout; if the class attributes of all the row cells are the same, the table layout is a vertical layout, and therefore, whether the row cells are horizontal or vertical can be judged according to the distribution of the class attributes.
(4) And judging the table layout according to whether the data types of the direct contents in the same row or the same column are the same.
Except for the T L part, the data of cells under each field name of T L should be the same in type as long as they are not null (the method can only distinguish between 'pure numeric string', 'date-time string', 'no obvious character string'). for example, the table in fig. 2 is a horizontal layout in which the data types of cells in each column are the same except for the first row T L, for example, the column of the field name "serial number" is a pure numeric string, the column of the field name "execution court" is a 'no obvious character string', the column of the field name "execution case" is a 'no obvious character string', and in short, the data types of the columns are the same except for the T L row.
According to the characteristics, whether the data types of the same row are the same or not is detected, and if the data types of all the rows of the table are the same (namely, the data types of all the cells in the same row are either 'pure numeric character strings' or 'date-time character strings' or 'no obvious characteristic character strings'), the table is in a longitudinal layout; and detecting whether the data types of the same column are the same or not, and the data types of all columns of the table are the same (namely, the data types of all cells in the same column are 'pure numeric character strings', or 'date-time character strings' or 'no obvious characteristic character strings'), so that the table is in a horizontal layout.
In order to avoid the influence of the cells on the detection result, the cells with empty contents do not fall into the detection range when the rows and the columns are detected.
The data volume of the business data part of the table is generally large, and the detection on all rows and columns can reduce the judgment efficiency, so that short circuit judgment can be adopted, namely, if the judgment result of a new row can deny a certain layout, the judgment can be skipped.
(5) The table layout is judged according to the th/td distribution.
The number of cells of T L is less than or equal to the number of cells of other rows, the number of cells of non-T L should be uniform, the number of cells of all rows and columns is counted according to th/td distribution, the row or column with obviously less number of cells may be T L, and the number of T L can be used to obtain the table layout according to whether T L is horizontal or vertical.
th is generally used to define a title, and corresponds to field names such as 'name', 'age'. the layout of th may also have differences between horizontal and vertical, for example, the horizontal layout is < colspan ═ 3 '> achievement list </th >, and the vertical layout is < throwspan ═ 3' > achievement list </th >.
td can be used to define a general cell and also can be used to define a title.
And judging the table layout according to th distribution when the th label and the td label exist at the same time. But many tables will not specify th, at which time the table layout is judged from the td distribution.
In addition, the method of the embodiment can identify the condition of multiple T L in the table and improve the reliability of extracted data.
In the case where the table layout is a vertical layout, it is also necessary to shift the table formed by the direct content to a horizontal layout.
T L is divided into two types, i.e., a single-stage T L and a multi-stage T L, but they are collectively referred to as T L unless otherwise specified, as shown in fig. 2, only one T L and a single-stage T L, as shown in fig. 7, only one T L and a multi-stage T L (composed of a plurality of rows with upper and lower levels of membership) are combined to form a single-row field name output, as shown in fig. 7, T L in the original table is divided into two parts, the left part is a plurality of rows (multi-stage) and the right part is a single row, the first stage of the multi-stage part is a combined cell, the field name is 'basic information', the second stage of the multi-stage part is 'name', 'age', 'gender', and finally a single-stage T L is output, and has a structure of "basic information _ name", "basic information _ age", "basic information _ gender", "other field a" and "other field B".
For the case that the table layout is multiple T L, the table formed by the direct content needs to be cut and combined to be converted into the layout of a single T L so as to meet the format requirement of the structured data, wherein the cut and combined operation comprises the steps of comparing the direct contents of multiple T L, only reserving one row of T L of the T L with the same content, as shown in FIG. 5, and splicing the T L with different contents into one row of T L, as shown in FIG. 6.
Finally, for the merged cells, the special mark can be corrected according to the service requirement. For example
ABC {←} {←} {←} {←}
The following formats can be adjusted:
ABC ABC ABC ABC ABC
the method for extracting the structured information in the HTM L Table tags in the self-adaptive mode is an extraction method for a single Table, when a plurality of Table tags (a plurality of tables) exist in a webpage, the method for extracting the structured information in the HTM L Table tags in the self-adaptive mode is only needed to be repeatedly used, the Table corresponding to each Table tag is extracted, and then the extraction results are combined according to a preset rule.
For the data extraction of the div layout, step S24 specifically includes: and acquiring the label matched with the known service field name from the div label, judging the structural layout according to the position of the label in the div label, and acquiring service data according to the structural layout.
L abel is a field name in a div label, such as "name", "age" and "sex" in example one, and tables with div layout in example two are extracted from the div label according to the known field name, and in example one, the extracted labels are all in the labels on the right side, the structural layout of the tables can be determined to be a left-right key value layout (vertical layout), and in example two, the extracted labels are all in the labels on a row, the structural layout of the tables can be determined to be an upper-lower layout (horizontal layout).
Example 1
< div > < div > name </div > < three </div >
< div > < div > age </div > < div >18</div >
Sex < div > male </div >
Example two
< div > < div > name </div > < div > age </div > < div > sex </div >
< div > < div >18</div > < men </div >, a pharmaceutical composition containing the same, and a method for producing the same
Based on the same inventive concept as the data processing method for improving the stability and the availability of the web crawler, the present embodiment further provides a data processing apparatus for improving the stability and the availability of the web crawler, as shown in fig. 8, including: the structural change detection module is used for judging whether the current page has local structural change according to the characteristics specified in advance; the analysis module is used for acquiring the structural layout of the current page if structural change does not occur, and analyzing the content in the current page according to the structural layout of the current page; and the field self-adaptive adjusting module is used for carrying out self-adaptive mapping on the service field names acquired by analysis according to a preset mapping rule and storing the service field names into a storage area.
The data processing method for improving the stability and the usability of the web crawler can automatically identify the non-structural change of the web page, adopts the self-adaptive data extraction logic, does not need frequent maintenance, saves the cost, improves the stability of web data crawling, and has better universality.
Further, the structural change detection module is specifically configured to: and comparing the pre-specified features with the corresponding tags of the current page one by one, and if the features are inconsistent with the corresponding tags of the current page, determining that the current page has local structural change.
The parsing module is further specifically configured to obtain an HTM L file of the current page, extract content in a Table tag and content in a div tag from the HTM L file, obtain a structural layout of the current page according to the content in the Table tag, parse the content according to the structural layout of the current page, obtain the structural layout of the current page according to the content in the div tag, and parse the content according to the structural layout of the current page.
Further, in the parsing module, obtaining the structural layout of the current page according to the content in the Table tag, and parsing the content according to the structural layout of the current page includes: detecting a header portion in a Table tag; extracting multi-dimensional information of the header-removed part in the Table label; judging the structural layout according to the extracted multidimensional information; and acquiring the service data according to the structural layout.
Further, in the parsing module, obtaining the structural layout of the current page according to the content in the div tag, and parsing the content according to the structural layout of the current page includes: and acquiring the label matched with the known service field name from the div label, judging the structural layout according to the position of the matched label in the div label, and acquiring service data according to the structural layout.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit and scope of the present invention, and they should be construed as being included in the following claims and description.

Claims (6)

1. A data processing method for improving web crawler stability and availability, comprising:
step S1, judging whether the current page has local structural change according to the pre-designated characteristics;
step S2, if no structural change occurs, acquiring the structural layout of the current page, and analyzing the content in the current page according to the structural layout of the current page;
step S3, according to the preset mapping rule, the business field name obtained by analysis is mapped in a self-adapting way and stored in the storage area;
the self-adaptive mapping is to replace the service field names obtained by analysis with preset standard fields so as to uniformly extract the service field names of the data;
the step S2 includes:
acquiring an HTM L file of the current page;
extracting the content in a Table label and the content in a div label from the HTM L file;
acquiring the structural layout of the current page according to the content in the Table tag, and analyzing the content according to the structural layout of the current page;
acquiring the structural layout of the current page according to the content in the div tag, and analyzing the content according to the structural layout of the current page;
the obtaining the structural layout of the current page according to the content in the Table tag and analyzing the content according to the structural layout of the current page includes:
detecting a header portion in the Table tag;
extracting multi-dimensional information of the header-removed part in the Table label;
judging the structural layout according to the extracted multi-dimensional information;
and acquiring service data according to the structural layout.
2. The method according to claim 1, wherein the step S1 includes: and comparing the characteristics specified in advance with the corresponding labels of the current page one by one, and if the characteristics are inconsistent with the corresponding labels of the current page, determining that the current page has local structural change.
3. The method of claim 1, wherein the obtaining the structural layout of the current page according to the content in the div tag and parsing the content according to the structural layout of the current page comprises: and acquiring the label matched with the known service field name from the div label, judging the structural layout according to the position of the matched label in the div label, and acquiring service data according to the structural layout.
4. A data processing apparatus for improving web crawler stability, availability, comprising:
the structural change detection module is used for judging whether the current page has local structural change according to the characteristics specified in advance;
the analysis module is used for acquiring the structural layout of the current page if structural change does not occur, and analyzing the content in the current page according to the structural layout of the current page;
the field self-adaptive adjusting module is used for carrying out self-adaptive mapping on the service field names acquired through analysis according to a preset mapping rule and storing the service field names into a storage area;
the self-adaptive mapping is to replace the service field names obtained by analysis with preset standard fields so as to uniformly extract the service field names of the data;
the analysis module is specifically configured to:
acquiring an HTM L file of the current page;
extracting the content in a Table label and the content in a div label from the HTM L file;
acquiring the structural layout of the current page according to the content in the Table tag, and analyzing the content according to the structural layout of the current page;
acquiring the structural layout of the current page according to the content in the div tag, and analyzing the content according to the structural layout of the current page;
in the analyzing module, acquiring the structural layout of the current page according to the content in the Table tag, and analyzing the content according to the structural layout of the current page includes:
detecting a header portion in the Table tag;
extracting multi-dimensional information of the header-removed part in the Table label;
judging the structural layout according to the extracted multi-dimensional information;
and acquiring service data according to the structural layout.
5. The apparatus of claim 4, wherein the structural change detection module is specifically configured to: and comparing the characteristics specified in advance with the corresponding labels of the current page one by one, and if the characteristics are inconsistent with the corresponding labels of the current page, determining that the current page has local structural change.
6. The apparatus of claim 4, wherein the parsing module obtains the structural layout of the current page according to the content in the div tag, and parses the content according to the structural layout of the current page, and the parsing module includes: and acquiring the label matched with the known service field name from the div label, judging the structural layout according to the position of the matched label in the div label, and acquiring service data according to the structural layout.
CN201611243842.5A 2016-12-29 2016-12-29 Data processing method and device for improving stability and usability of web crawler Active CN106777281B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611243842.5A CN106777281B (en) 2016-12-29 2016-12-29 Data processing method and device for improving stability and usability of web crawler

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611243842.5A CN106777281B (en) 2016-12-29 2016-12-29 Data processing method and device for improving stability and usability of web crawler

Publications (2)

Publication Number Publication Date
CN106777281A CN106777281A (en) 2017-05-31
CN106777281B true CN106777281B (en) 2020-07-17

Family

ID=58928579

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611243842.5A Active CN106777281B (en) 2016-12-29 2016-12-29 Data processing method and device for improving stability and usability of web crawler

Country Status (1)

Country Link
CN (1) CN106777281B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107463669B (en) * 2017-08-03 2020-05-05 深圳市华傲数据技术有限公司 Method and device for analyzing webpage data crawled by crawler
CN108647279A (en) * 2018-05-03 2018-10-12 山东浪潮通软信息科技有限公司 Sheet disposal method, apparatus, medium and storage control based on field multiplexing
CN109657125A (en) * 2018-12-14 2019-04-19 平安城市建设科技(深圳)有限公司 Data processing method, device, equipment and storage medium based on web crawlers
CN109948018B (en) * 2019-01-10 2021-05-25 北京大学 Method and system for rapidly extracting Web structured data

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6938170B1 (en) * 2000-07-17 2005-08-30 International Business Machines Corporation System and method for preventing automated crawler access to web-based data sources using a dynamic data transcoding scheme
CN101576891A (en) * 2008-05-05 2009-11-11 北京瑞佳晨科技有限公司 Method for analyzing web page form object nodes
CN102254009B (en) * 2011-07-15 2013-05-01 福建星网锐捷通讯股份有限公司 Method for extracting data of webpage table
CN103198069A (en) * 2012-01-06 2013-07-10 株式会社理光 Method and device for extracting relational table
CN104598462B (en) * 2013-10-30 2018-08-07 深圳市国信互联科技有限公司 Extract the method and device of structural data
CN103942335B (en) * 2014-05-07 2017-04-26 武汉大学 Construction method of uninterrupted crawler system oriented to web page structure change
CN104767757B (en) * 2015-04-17 2018-01-23 国家电网公司 Various dimensions safety monitoring method and system based on WEB service
CN105975395A (en) * 2016-05-30 2016-09-28 深圳市华傲数据技术有限公司 Website state reconnaissance method and device

Also Published As

Publication number Publication date
CN106777281A (en) 2017-05-31

Similar Documents

Publication Publication Date Title
CN106709032B (en) Method and device for extracting structured information in electronic form document
CN106777281B (en) Data processing method and device for improving stability and usability of web crawler
US6912555B2 (en) Method for content mining of semi-structured documents
CN107844565B (en) Commodity searching method and device
CN106156239B (en) Table extraction method and device
CN101727461B (en) Method for extracting content of web page
US20170337260A1 (en) Method and device for storing data
US8725781B2 (en) Sentiment cube
CN110968667B (en) Periodical and literature table extraction method based on text state characteristics
US20110113048A1 (en) Enabling Faster Full-Text Searching Using a Structured Data Store
CN106777259A (en) The method and device of structured message in adaptive decimation HTML Table labels
US20130086035A1 (en) Method and apparatus for generating extended page snippet of search result
CN105677764A (en) Information extraction method and device
CN109492177B (en) web page blocking method based on web page semantic structure
CN102270206A (en) Method and device for capturing valid web page contents
CN102314497B (en) Method and equipment for identifying body contents of markup language files
US20130339840A1 (en) System and method for logical chunking and restructuring websites
CN110704570A (en) Continuous page layout document structured information extraction method
US20040243936A1 (en) Information processing apparatus, program, and recording medium
CN109165373B (en) Data processing method and device
JP2005063432A (en) Multimedia object retrieval apparatus and multimedia object retrieval method
US10755091B2 (en) Method and apparatus for retrieving image-text block from web page
US20130124684A1 (en) Visual separator detection in web pages using code analysis
CN113407731A (en) API recommendation method based on knowledge graph and collaborative filtering
US20150058716A1 (en) System and method for summarizing documents

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP02 Change in the address of a patent holder

Address after: 518000 2203/2204, Building 1, Huide Building, Beizhan Community, Minzhi Street, Longhua District, Shenzhen, Guangdong

Patentee after: SHENZHEN AUDAQUE DATA TECHNOLOGY Ltd.

Address before: 518000 units J and K, 12 / F, block B, building 7, Baoneng Science Park, Qinghu Industrial Zone, Qingxiang Road, Longhua New District, Shenzhen City, Guangdong Province

Patentee before: SHENZHEN AUDAQUE DATA TECHNOLOGY Ltd.

CP02 Change in the address of a patent holder