CN106777281A

CN106777281A - For improving web crawlers stability, the data processing method of availability and device

Info

Publication number: CN106777281A
Application number: CN201611243842.5A
Authority: CN
Inventors: 张军; 贾西贝
Original assignee: Shenzhen Huaao Data Technology Co Ltd
Current assignee: Shenzhen Huaao Data Technology Co Ltd
Priority date: 2016-12-29
Filing date: 2016-12-29
Publication date: 2017-05-31
Anticipated expiration: 2036-12-29
Also published as: CN106777281B

Abstract

The present invention relates to a kind of for improving web crawlers stability, the data processing method of availability and device.The method that the present invention is provided, including：Step S1, according to the feature specified in advance, judges whether current page there occurs that local structure changes；Step S2, if non-recurring structure changes, obtains the topology layout of the current page, and the topology layout according to the current page parses the content in the current page；Step S3, according to the mapping ruler being pre-configured with, to doing self organizing maps by parsing the service fields for obtaining name, and stores to memory block.Provided by the present invention for improving web crawlers stability, the data processing method of availability and device, can be changed with the unstructuredness of automatic identification Webpage, and using the data pick-up logic of self adaptation, without frequent maintenance.

Description

For improving web crawlers stability, the data processing method of availability and device

Technical field

The present invention relates to technical field of data processing, and in particular to one kind is for improving web crawlers stability, availability Data processing method and device.

Background technology

With the popularization and development of internet, e-commerce website, portal website, blog, various types of letters such as microblogging Breath is all issued on the internet, and people by internet can collect magnanimity information and be analyzed, count, to obtain needs Information.

Existing method is, using web crawlers technical limit spacing information, to remove the binary contents such as picture, video, and network is climbed What worm typically obtained is webpage text content, and traditional reptile enters the solution of row information using regular expression, xpath or position Analysis.

But the problem for existing is that webpage is dynamic change, such as：The position of service fields name/field value, the mark of html Signing id, xpath path can may change at any time.The dynamic characteristic of webpage determines the characteristic of web crawlers frequent maintenance, Therefore, existing web crawlers universality is poor, maintenance cost is very high.

The content of the invention

For defect of the prior art, provided by the present invention for improving web crawlers stability, the data of availability Processing method and processing device, can be changed with the unstructuredness of automatic identification Webpage, and using the data pick-up logic of self adaptation, Without frequent maintenance.

In a first aspect, present invention offer is a kind of for improving web crawlers stability, the data processing method of availability, Including：Step S1, according to the feature specified in advance, judges whether current page there occurs that local structure changes；Step S2, if Non- recurring structure changes, then obtain the topology layout of the current page, the topology layout parsing according to the current page Content in the current page；Step S3, according to the mapping ruler being pre-configured with, to by parsing the service fields for obtaining name Self organizing maps are done, and is stored to memory block.

Provided by the present invention for improving web crawlers stability, the data processing method of availability, can be with automatic identification The unstructuredness change of Webpage, and using the data pick-up logic of self adaptation, without frequent maintenance, cost has been saved, together When improve the stability that web data is crawled, possess more preferable universality.

Preferably, the step S1 includes：The corresponding label of the feature and current page specified in advance is compared one by one, if not Unanimously, then it is assumed that the current page there occurs that local structure changes.

Preferably, the step S2 includes：Obtain the html file of the current page；Extracted from the html file The content in content and div tag in Table labels；Current page described in content obtaining in the Table labels Topology layout, according to the current page topology layout parsing content；Described in content obtaining in the div tag The topology layout of current page, the topology layout parsing content according to the current page.

Preferably, the topology layout of current page described in the content obtaining in the Table labels, according to institute The topology layout parsing content of current page is stated, including：Detect the title division in the Table labels；Extract the Table Except the various dimensions information of title division in label；The various dimensions information according to extracting judges topology layout；According to the knot Structure layout obtains business datum.

Preferably, the topology layout of current page described in the content obtaining in the div tag, according to described The topology layout parsing content of current page, including：Obtain what is matched with known business field name from the div tag Label, and the position judgment topology layout according to the label for matching in div tag, business number is obtained according to topology layout According to.

Second aspect, a kind of data processing equipment for improving web crawlers stability, availability that the present invention is provided, Including：Structural variation detection module, for according to the feature specified in advance, judging whether current page there occurs partial structurtes Property change；Parsing module, if being changed for non-recurring structure, obtains the topology layout of the current page, according to described The topology layout of current page parses the content in the current page；Field self-adaptative adjustment module, according to what is be pre-configured with Mapping ruler, to doing self organizing maps by parsing the service fields for obtaining name, and stores to memory block.

Data processing equipment provided by the present invention for improving web crawlers stability, availability, can be with automatic identification The unstructuredness change of Webpage, and using the data pick-up logic of self adaptation, without frequent maintenance, cost has been saved, together When improve the stability that web data is crawled, possess more preferable universality.

Preferably, it is described it is structural variation detection module specifically for：Feature and the current page specified in advance are compared one by one The corresponding label in face, if inconsistent, then it is assumed that the current page there occurs that local structure changes.

Preferably, the parsing module specifically for：Obtain the html file of the current page；From the html file Extract the content in the content and div tag in Table labels；Described in content obtaining in the Table labels when The topology layout of the preceding page, the topology layout parsing content according to the current page；Content in the div tag is obtained The topology layout of the current page is taken, the topology layout parsing content according to the current page.

Preferably, in the parsing module, the structure of current page described in the content obtaining in the Table labels Layout, the topology layout parsing content according to the current page, including：Detect the title division in the Table labels；Take out Take in the Table labels except the various dimensions information of title division；The various dimensions information according to extracting judges topology layout； Business datum is obtained according to the topology layout.

Preferably, in the parsing module, the structure cloth of current page described in the content obtaining in the div tag Office, the topology layout parsing content according to the current page, including：Obtained and known business field from the div tag The label of name matching, and the position judgment topology layout according to the label for matching in div tag, obtain according to topology layout Take business datum.

Brief description of the drawings

Fig. 1 by the embodiment of the present invention provide for improving web crawlers stability, the data processing method of availability Flow chart；

Fig. 2 is the layout of the title division, remarks section and business datum part in an example table；

Fig. 3 is the example of many TL layouts in longitudinal direction；

Fig. 4 is the example of laterally many TL layouts；

Fig. 5 is an example for the form of many TL layouts cut merging；

Fig. 6 is an example for the form of many TL layouts cut merging；

Fig. 7 is the example processed the form of single TL (multistage) layouts；

The data processing equipment for improving web crawlers stability, availability that Fig. 8 is provided by the embodiment of the present invention Structured flowchart.

Specific embodiment

The embodiment of technical solution of the present invention is described in detail below in conjunction with accompanying drawing.Following examples are only used for Technical scheme is clearly illustrated, therefore is intended only as example, and protection of the invention can not be limited with this Scope.

It should be noted that unless otherwise indicated, technical term used in this application or scientific terminology should be this hair The ordinary meaning that bright one of ordinary skill in the art are understood.

Form in webpage is by HTML<table>Label is defined.The row of form by<tr>Tag definition,<tr>Must Must be at one<table></table>The inside, it is impossible to be used alone.Often row be divided into some cells, each cell by< td>Tag definition,<td>Needs are nested in<tr></tr>It is middle.<th>With<td>Equally it is also that needs are nested in<tr>It is central ,<th>...</th>For defining gauge outfit cell, comprising be Table Header information.Detailed directions are as follows：

The form that above-mentioned code shows in webpage is as follows：

Name	Age
		Zhang San	40

Div tag in HTML is used for subregion or section (division/section) in definition document.<div>Label can Independent, different parts are divided into document.It can serve as strict organization tool, and do not use any form with Its association.

Present embodiments provide it is a kind of for improving web crawlers stability, the data processing method of availability, such as Fig. 1 institutes Show, including：

Step S1, according to the feature specified in advance, judges whether current page there occurs that local structure changes.

Wherein, the feature specified in advance refers to the topology layout of webpage, shown as in HTML the type of label, position, Attribute etc..Structural variation refers to that the topology layout of the page there occurs change, such as：Certain label disappears, certain label Attribute there occurs change, or Table line numbers, columns have become.

Step S2, if non-recurring structure changes, obtains the topology layout of current page, according to the structure of current page Content in layout parsing current page.

Step S3, according to the mapping ruler being pre-configured with, to doing self organizing maps by parsing the service fields for obtaining name, And store to memory block.

Wherein, service fields name refers to the title name of each business datum, " performing law court " in such as Fig. 2, " execution case Number " etc..Self organizing maps refer to that will parse the service fields name for obtaining to replace with predetermined criteria field, to unify extracted data Service fields name, facilitate management and the statistics of follow-up data.It is to deposit for example by " enterprise name ", " organization names " automatic mapping " Business Name " of reservoir.

The present embodiment provide for improving web crawlers stability, the data processing method of availability, can know automatically The unstructuredness change of other Webpage, and using the data pick-up logic of self adaptation, without frequent maintenance, cost has been saved, The stability that web data is crawled is improve simultaneously, possesses more preferable universality.

Wherein, step S1 is specifically included：The corresponding label of the feature and current page specified in advance is compared one by one, if differing Cause, then it is assumed that current page there occurs that local structure changes.

Polytype label, such as Table labels, div tag may be included in HTML.The extracting method of different labels Difference, in order to adapt to the HTML of mixed type, step S2 is specifically included：

Step S21, obtains the html file of current page.

Step S22, the content in the content and div tag that extract in Table labels from html file.

Step S23, the topology layout of the content obtaining current page in Table labels, according to the knot of current page Structure layout parsing content.

Step S24, the topology layout of the content obtaining current page in div tag, according to the structure of current page Layout parsing content.

Form on webpage enters edlin by the way of HTML Table labels, and these information are mostly semi-structured numbers Although more regular according to the display effect on the page, bottom label and data are simultaneously irregular or even very chaotic, cause Title division and mixed in together with business datum, it is impossible to rapidly and accurately extract business datum.In order to automatic, quick, accurate Ground extracts the data in web page form, and step S23 is specifically included：

Step S231, the title division in detection Table labels.

As shown in Fig. 2 form caption part is generally all a big Merge Cells, it may be possible to a line or multirow, table Lattice can also include remarks section, and the structure of remarks section is similar with title division, remove title division and the remarks portion of form Point, remainder is exactly the business datum for needing to extract.When there is remarks section in form, detection is also needed in step S1 standby Note part, detection mode is identical with the detection mode of title division.

Step S232, except the various dimensions information of title division in extraction Table labels.

Wherein, various dimensions information includes：Direct content, th/td distributions, class property distributions, background-color Property distribution etc..Direct content is the content directly displayed in form in webpage, i.e.,<table>Content of text in label, such as " name ", " age ", " Zhang San ", " 40 ".Th/td distributions refer to distributing position of the th and td labels in this table.class Attribute specifies the class name of element in cell, and class property distributions refer to distributing position of the class attributes in this table. Background-color attributes define the background color of cell, and background-color property distributions refer to Distributing position of the background-color attributes in this table.

Step S233, topology layout is judged according to the various dimensions information for extracting.

Wherein, common table-layout is divided into laterally list TL, laterally many TL, longitudinal list TL, many TL in longitudinal direction, many table packs. TL (TitleLine) is row head (or data header part) (may be physically multirow, but be in logic a region), is represented The first row of the business datum part in the title of every business datum, such as Fig. 2 and be TL.TL can be horizontal, it is also possible to It is longitudinal, is illustrated in figure 3 many TL layouts in longitudinal direction, Fig. 4 is horizontal many TL layouts.

Step S234, business datum is obtained according to topology layout.

The method that step S23 provides structured message in adaptive decimation HTML Table labels, detects Table first Title division in label, eliminates the content for being not belonging to business datum part, prevents being mixed into for hash；Then extract Except the various dimensions information of title division in Table labels, the topology layout of form is judged according to various dimensions informix, due to Information in Table labels can reflect table-layout, therefore, no matter the form in webpage there occurs that what kind of changes, and pass through Information in Table labels is analyzed, new table-layout can be obtained.Therefore, the method that step S23 is provided, nothing The layout of form need in advance be known, the HTML Table for different structure need not again write program, solve existing Table extraction algorithms lack the problem of universality, while the reliability of extracted data is improve, especially to extensive semi-structured When data identification and extraction more effectively.

Title division and remarks section are general all in the first row or the second row of form, and are a Merge Cellses, because This, the specific implementation of step S231 includes：Detect whether per a line be a Merge Cells in Table labels, if It is that then detected row belongs to title division, and carries out the detection of next line；If it is not, represent that the row is initially business datum, Then stop the detection of title division.For example, the code of title division and remarks section is generally following form：

<tr><Td colspan=' 5 '>People information statistical form in 2016</td></tr>

Above-mentioned code only includes one<td>Label, and colspan=' 5 ' shows that this is a Merge Cells, leads to Cross detection<td>Just title division and remarks section can be recognized with colspan with industry.

Prior art, when hash (such as title division, remarks section) is filtered, it is necessary to be known a priori by hash Position, then specifies good position to skip first few lines hash in a program.And the method for the present embodiment is with more general Property, no matter how many row headers part of form and remarks section, accurately and efficiently it can be detected, to guarantee Business datum is drawn into exactly.

During extracted data, in addition to direct access standard element lattice corresponding informance, in addition it is also necessary to which Merge Cells is done Especially treatment, makes the data of extraction meet storage format, facilitates subsequent treatment, therefore, the preferred embodiment of step S232 includes：Take out Take except the various dimensions information of title division (have remarks section if, also including remarks section) in Table labels, to being extracted letter After Merge Cells in breath is split, then the information of each dimension is stored in two-dimensional array form respectively, and to splitting Cell do special marking.

Wherein, Merge Cells is divided into horizontal meaders (colspan), vertical consolidation (rowspan), mixing merging again (colspan+rowspan).For example：It is right<The bgcolor=of td colspan=' 5 ' " #F7FBFE ">ABC</td>Extract direct After content：

ABC

{←}

Wherein, special marking " { ← } " is that the direct content of extraction is distinctive, represents the content in the cell and its left side Content in cell is identical, in order to treatment and final content the output offer flexibility to TL, and other data Extraction need not do special marking.

Extracting ' background-color property distributions ' is：

#F7FBFE

When there is multiple horizontal meaders (colspan) in single file, in addition it is also necessary to note the problem of coordinate translation.For example <Td colspan=' 2 '>ABC</td><Td colspan=' 3 '>DEF</td>

ABC

{←}

DEF

{←}

It is also adopted by similar method and enters line number for vertical consolidation (rowspan), mixing merging (colspan+rowspan) According to extraction.

Only know table-layout, could exactly extract business datum, and be converted into form according to table-layout Structural data.Judgement table-layout in step S233 includes following several operations：

(1) according to the direct content for extracting, exclusion is not the row and column of TL.

Removing property judgement is carried out according to the data type of direct content, length, keyword in TL.Its basis for estimation includes： Field name length in TL each cell can not possibly exceed threshold value (such as 50), and the field name number of TL can not possibly exceed threshold Value (such as 1000), field name is unlikely to be pure digi-tal character string, and common field name includes " title ", " Name ", " The keyword such as location ", " Address ", " address ", " type ", " remarks ", keywords database is obtained according to common table statistics, inspection Whether survey in row or column comprising the keyword in keywords database.

Therefore, the step that implements for carrying out table-layout judgement based on direct content is：Line by line, detect what is extracted by column Direct content；If the data type of direct content is numeric type character string, row or column where direct content is not just TL；If straight The field length for connecing content exceedes first threshold, then row or column where direct content is not just TL；If a row or column is multinomial Comprising given keyword in direct content, then row or column is TL.

Wherein, when using the determination methods for being based on keyword, in order to ensure to judge reliability, at least need occur two Keyword is just it can be assumed that the row or column is TL.

(2) according to the background-color property distributions for extracting, table-layout is judged.

When form shows, in order to provide the user with the convenience of reading, the background colour meeting of the background colour and data of form TL Difference, or the parity rows of data can use background colour staggeredly, therefore, background-color property distributions can be used to Judge which row or column is probably TL, and then judge that table-layout is transverse direction or longitudinal direction.

(3) according to the class property distributions for extracting, table-layout is judged.

The cell for having identical class attributes is usually similar cell.If the class attributes of all row cells are equal Identical, then table-layout is landscape layout；If the class attribute all sames of all row cells, table-layout is longitudinal cloth Office, therefore, transverse direction or longitudinal direction may determine that by class property distributions.

(4) whether according to identical with the data type of the direct content in a line or same row, table-layout is judged.

TL parts are removed in the business datum of form, the data of the cell under each field name of TL are so long as not null value , their type should all be that (the method can only distinguish ' pure digi-tal ocra font ocr string ', ' date-time ocra font ocr to identical String ', ' without obvious characteristic character string ').Such as form in Fig. 2 is landscape layout, wherein business datum part, except the first row TL, the data type of remaining each column unit lattice is all identical, and such as field name " sequence number " this row are all pure digi-tal ocra font ocrs String, field name " performing law court " this row are all ' without obvious characteristic character string ' that field name " execution Reference Number " this row are all ' nothings Obvious characteristic character string ', in a word, in addition to TL rows, the data type of each row is all identical.

According to above-mentioned characteristic, detect whether all identical with the data type of a line, if the data type of all rows of the form All identical (either all it is ' pure digi-tal ocra font ocr string ' i.e. with the data type of all cells in a line, or all it is ' during the date Between ocra font ocr string ' or ' without obvious characteristic character string '), then the form for longitudinal direction be laid out；Whether the data type of detection same row It is all identical, the data type of all row of form it is all identical (or the data type of all cells is all ' cardinar number i.e. in same row Font character string ', or be all ' date-time ocra font ocr string ' or ' without obvious characteristic character string '), then the form is horizontal cloth Office.

Can be present the situation of null value in some cells, to avoid these cells from influenceing testing result, done to row and column During detection, content is that empty cell does not include detection range.

The data volume of form business datum part is typically more, and all row and columns are all carried out with detection can reduce judgement effect Rate, it is therefore possible to use short circuit judges, if the judged result of that is, new a line can negate certain layout, can jump out and sentence It is disconnected.

(5) it is distributed according to th/td, judges table-layout.

The cell quantity of TL is less than or equal to the cell quantity of other rows, and the cell quantity of non-TL should compare system One.According to the cell quantity of all row and columns of th/td distribution statisticses, the substantially few row or column of cell quantity may be TL, and be laterally or longitudinal according to TL, the quantity of TL can be obtained by table-layout.

Th is generally used to define title, and corresponding is exactly ' name ', field name as ' age '.The layout of th is likely to There is difference laterally, longitudinal, such as landscape layout is<Th colspan=' 3 '>List of results</th>, longitudinal direction is laid out and is<th Rowspan=' 3 '>List of results</th>.

Td can be used to define common cell, it is also possible to for defining title.

When having th labels and td labels simultaneously, table-layout is judged according to th distributions.But many table will not be specified Th, now judges table-layout according to td distributions.

After above-mentioned several methods for judging table-layout can carry out various combinations according to the actual requirements, table-layout is entered Row judgement, improves judging nicety rate；In addition, the method for the present embodiment is capable of identify that the situation of many TL in form, improves and extract The reliability of data.

When table-layout is longitudinal direction layout, in addition it is also necessary to which the form transposition for forming direct content is horizontal cloth Office.

TL is divided into single-stage TL, two kinds of multistage TL again, but in the case where illustrating, is all referred to as TL.Such as Fig. 2 institutes Show, only one of which TL and be single-stage TL.As shown in fig. 7, only one of which TL and for multistage TL (is made up of multirow, there is the superior and the subordinate Membership), the field name in this multirow need to be merged, form the output of single file field name.As shown in fig. 7, in former form TL points is two parts, and left-hand component is multirow (multistage), and right-hand component is single file, and the first order of multistage part is combining unit Lattice, field is entitled ' essential information ', and the second level of multistage part is ' name ', ' age ', ' sex ' field, final output single-stage TL, its structure is " essential information _ name ", " essential information _ age ", " essential information _ sex ", " other fields A ", " its He is field B ".

When table-layout is many TL, in addition it is also necessary to do cutting union operation to the form that direct content is formed, will Its layout for being converted to single TL, to meet the call format of structural data.Cutting union operation includes：Compare the straight of multiple TL Connect content；Content identical TL only retains a line TL, as shown in Figure 5；The different TL of content is spliced into TL in a row, such as Fig. 6 institutes Show.

Finally, for Merge Cells, special marking can be corrected according to service needed.For example

ABC

{←}

Can be adjusted to following form：

ABC

The method of structured message is directed to the extraction side of single form in above-mentioned adaptive decimation HTML Table labels Method, when there is multiple Table label (multiple forms) in webpage, only need to reuse above-mentioned adaptive decimation HTML Table The method of structured message in label, extracts each corresponding form of Table labels, and result then will be extracted at predetermined regular Merge.

For the data pick-up of div layouts, step S24 is specifically included：Obtained and known business field name from div tag The label of matching, and the position judgment topology layout according to label in div tag, business number is obtained according to topology layout According to.

It is known that service fields name can be previously given, or obtained according to the historical data statistics of parsing.Label It is the field name in div tag, " name ", " age " and " sex " such as in example one.Example one and example two are div layouts Form.Such as, " name ", " age " and " sex " these three words are extracted from div tag according to known business field name Section name, in example one, in the label on right side, then the topology layout that can determine the form is left and right to the label of extraction Key assignments layout (longitudinal direction layout)；And in example two, the label of extraction can then determine the form in a row label Topology layout be top-bottom layout (landscape layout).

Example one

<div><div>Name</div><div>Zhang San</div></div>

Example two

<div><div>Zhang San</div><div>18</div><div>Man</div></div>

Based on the above-mentioned data processing method identical inventive concept for improving web crawlers stability, availability, The present embodiment additionally provides a kind of data processing equipment for improving web crawlers stability, availability, as shown in figure 8, bag Include：Structural variation detection module, for according to the feature specified in advance, judging whether current page there occurs local structure Change；Parsing module, if being changed for non-recurring structure, obtains the topology layout of current page, according to current page Content in topology layout parsing current page；Field self-adaptative adjustment module, according to the mapping ruler being pre-configured with, to passing through Parse the service fields name for obtaining and do self organizing maps, and store to memory block.

Further, it is structural variation detection module specifically for：Feature and the current page specified in advance are compared one by one Corresponding label, if inconsistent, then it is assumed that current page there occurs local structure change.

Further, parsing module specifically for：Obtain the html file of current page；Extracted from html file The content in content and div tag in Table labels；The structure cloth of the content obtaining current page in Table labels Office, the topology layout parsing content according to current page；The topology layout of the content obtaining current page in div tag, Topology layout parsing content according to current page.

Further, in parsing module, the topology layout of the content obtaining current page in Table labels, according to The topology layout parsing content of current page, including：Title division in detection Table labels；Except mark in extraction Table labels Inscribe the various dimensions information of part；Various dimensions information according to extracting judges topology layout；Business datum is obtained according to topology layout.

Further, in parsing module, the topology layout of the content obtaining current page in div tag, according to work as The topology layout parsing content of the preceding page, including：The label that acquisition is matched with known business field name from div tag, and root Position judgment topology layout according to the label for matching in div tag, business datum is obtained according to topology layout.

Finally it should be noted that：Various embodiments above is merely illustrative of the technical solution of the present invention, rather than its limitations；To the greatest extent Pipe has been described in detail with reference to foregoing embodiments to the present invention, it will be understood by those within the art that：Its according to The technical scheme described in foregoing embodiments can so be modified, or which part or all technical characteristic are entered Row equivalent；And these modifications or replacement, the essence of appropriate technical solution is departed from various embodiments of the present invention technology The scope of scheme, it all should cover in the middle of the scope of claim of the invention and specification.

Claims

1. a kind of for improving web crawlers stability, the data processing method of availability, it is characterised in that including：

Step S1, according to the feature specified in advance, judges whether current page there occurs that local structure changes；

Step S2, if non-recurring structure changes, obtains the topology layout of the current page, according to the current page Topology layout parses the content in the current page；

Step S3, according to the mapping ruler being pre-configured with, to doing self organizing maps by parsing the service fields for obtaining name, and deposits Store up to memory block.

2. method according to claim 1, it is characterised in that the step S1 includes：The spy for specifying in advance is compared one by one Seek peace the corresponding label of current page, if inconsistent, then it is assumed that the current page there occurs that local structure changes.

3. method according to claim 1, it is characterised in that the step S2 includes：

Obtain the html file of the current page；

Content in the content and div tag that extract in Table labels from the html file；

The topology layout of current page described in content obtaining in the Table labels, according to the knot of the current page Structure layout parsing content；

The topology layout of current page described in content obtaining in the div tag, according to the structure of the current page Layout parsing content.

4. method according to claim 3, it is characterised in that the content obtaining institute in the Table labels The topology layout of current page is stated, the topology layout parsing content according to the current page, including：

Detect the title division in the Table labels；

Extract in the Table labels except the various dimensions information of title division；

The various dimensions information according to extracting judges topology layout；

Business datum is obtained according to the topology layout.

5. method according to claim 3, it is characterised in that described in the content obtaining in the div tag The topology layout of current page, the topology layout parsing content according to the current page, including：Obtained from the div tag Take the label matched with known business field name, and the position judgment structure cloth according to the label for matching in div tag Office, business datum is obtained according to topology layout.

6. a kind of data processing equipment for improving web crawlers stability, availability, it is characterised in that including：

Structural variation detection module, for according to the feature specified in advance, judging whether current page there occurs partial structurtes Property change；

Parsing module, if being changed for non-recurring structure, obtains the topology layout of the current page, according to described current The topology layout of the page parses the content in the current page；

Field self-adaptative adjustment module, according to the mapping ruler being pre-configured with, to being done certainly by parsing the service fields for obtaining name Mapping is adapted to, and is stored to memory block.

7. device according to claim 5, it is characterised in that the structural variation detection module specifically for：One by one The corresponding label of the feature and current page specified in advance is compared, if inconsistent, then it is assumed that the current page there occurs part Structural variation.

8. device according to claim 5, it is characterised in that the parsing module specifically for：

Obtain the html file of the current page；

9. device according to claim 8, it is characterised in that in the parsing module, according in the Table labels The topology layout of current page described in content obtaining, the topology layout parsing content according to the current page, including：

Detect the title division in the Table labels；

Business datum is obtained according to the topology layout.

10. device according to claim 8, it is characterised in that in the parsing module, according in the div tag The topology layout of current page described in content obtaining, the topology layout parsing content according to the current page, including：From described The label that acquisition is matched with known business field name in div tag, and the position according to the label for matching in div tag Judge topology layout, business datum is obtained according to topology layout.