CN106777281A - For improving web crawlers stability, the data processing method of availability and device - Google Patents

For improving web crawlers stability, the data processing method of availability and device Download PDF

Info

Publication number
CN106777281A
CN106777281A CN201611243842.5A CN201611243842A CN106777281A CN 106777281 A CN106777281 A CN 106777281A CN 201611243842 A CN201611243842 A CN 201611243842A CN 106777281 A CN106777281 A CN 106777281A
Authority
CN
China
Prior art keywords
current page
content
topology layout
layout
parsing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201611243842.5A
Other languages
Chinese (zh)
Other versions
CN106777281B (en
Inventor
张军
贾西贝
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Huaao Data Technology Co Ltd
Original Assignee
Shenzhen Huaao Data Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Huaao Data Technology Co Ltd filed Critical Shenzhen Huaao Data Technology Co Ltd
Priority to CN201611243842.5A priority Critical patent/CN106777281B/en
Publication of CN106777281A publication Critical patent/CN106777281A/en
Application granted granted Critical
Publication of CN106777281B publication Critical patent/CN106777281B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Abstract

The present invention relates to a kind of for improving web crawlers stability, the data processing method of availability and device.The method that the present invention is provided, including:Step S1, according to the feature specified in advance, judges whether current page there occurs that local structure changes;Step S2, if non-recurring structure changes, obtains the topology layout of the current page, and the topology layout according to the current page parses the content in the current page;Step S3, according to the mapping ruler being pre-configured with, to doing self organizing maps by parsing the service fields for obtaining name, and stores to memory block.Provided by the present invention for improving web crawlers stability, the data processing method of availability and device, can be changed with the unstructuredness of automatic identification Webpage, and using the data pick-up logic of self adaptation, without frequent maintenance.

Description

For improving web crawlers stability, the data processing method of availability and device
Technical field
The present invention relates to technical field of data processing, and in particular to one kind is for improving web crawlers stability, availability Data processing method and device.
Background technology
With the popularization and development of internet, e-commerce website, portal website, blog, various types of letters such as microblogging Breath is all issued on the internet, and people by internet can collect magnanimity information and be analyzed, count, to obtain needs Information.
Existing method is, using web crawlers technical limit spacing information, to remove the binary contents such as picture, video, and network is climbed What worm typically obtained is webpage text content, and traditional reptile enters the solution of row information using regular expression, xpath or position Analysis.
But the problem for existing is that webpage is dynamic change, such as:The position of service fields name/field value, the mark of html Signing id, xpath path can may change at any time.The dynamic characteristic of webpage determines the characteristic of web crawlers frequent maintenance, Therefore, existing web crawlers universality is poor, maintenance cost is very high.
The content of the invention
For defect of the prior art, provided by the present invention for improving web crawlers stability, the data of availability Processing method and processing device, can be changed with the unstructuredness of automatic identification Webpage, and using the data pick-up logic of self adaptation, Without frequent maintenance.
In a first aspect, present invention offer is a kind of for improving web crawlers stability, the data processing method of availability, Including:Step S1, according to the feature specified in advance, judges whether current page there occurs that local structure changes;Step S2, if Non- recurring structure changes, then obtain the topology layout of the current page, the topology layout parsing according to the current page Content in the current page;Step S3, according to the mapping ruler being pre-configured with, to by parsing the service fields for obtaining name Self organizing maps are done, and is stored to memory block.
Provided by the present invention for improving web crawlers stability, the data processing method of availability, can be with automatic identification The unstructuredness change of Webpage, and using the data pick-up logic of self adaptation, without frequent maintenance, cost has been saved, together When improve the stability that web data is crawled, possess more preferable universality.
Preferably, the step S1 includes:The corresponding label of the feature and current page specified in advance is compared one by one, if not Unanimously, then it is assumed that the current page there occurs that local structure changes.
Preferably, the step S2 includes:Obtain the html file of the current page;Extracted from the html file The content in content and div tag in Table labels;Current page described in content obtaining in the Table labels Topology layout, according to the current page topology layout parsing content;Described in content obtaining in the div tag The topology layout of current page, the topology layout parsing content according to the current page.
Preferably, the topology layout of current page described in the content obtaining in the Table labels, according to institute The topology layout parsing content of current page is stated, including:Detect the title division in the Table labels;Extract the Table Except the various dimensions information of title division in label;The various dimensions information according to extracting judges topology layout;According to the knot Structure layout obtains business datum.
Preferably, the topology layout of current page described in the content obtaining in the div tag, according to described The topology layout parsing content of current page, including:Obtain what is matched with known business field name from the div tag Label, and the position judgment topology layout according to the label for matching in div tag, business number is obtained according to topology layout According to.
Second aspect, a kind of data processing equipment for improving web crawlers stability, availability that the present invention is provided, Including:Structural variation detection module, for according to the feature specified in advance, judging whether current page there occurs partial structurtes Property change;Parsing module, if being changed for non-recurring structure, obtains the topology layout of the current page, according to described The topology layout of current page parses the content in the current page;Field self-adaptative adjustment module, according to what is be pre-configured with Mapping ruler, to doing self organizing maps by parsing the service fields for obtaining name, and stores to memory block.
Data processing equipment provided by the present invention for improving web crawlers stability, availability, can be with automatic identification The unstructuredness change of Webpage, and using the data pick-up logic of self adaptation, without frequent maintenance, cost has been saved, together When improve the stability that web data is crawled, possess more preferable universality.
Preferably, it is described it is structural variation detection module specifically for:Feature and the current page specified in advance are compared one by one The corresponding label in face, if inconsistent, then it is assumed that the current page there occurs that local structure changes.
Preferably, the parsing module specifically for:Obtain the html file of the current page;From the html file Extract the content in the content and div tag in Table labels;Described in content obtaining in the Table labels when The topology layout of the preceding page, the topology layout parsing content according to the current page;Content in the div tag is obtained The topology layout of the current page is taken, the topology layout parsing content according to the current page.
Preferably, in the parsing module, the structure of current page described in the content obtaining in the Table labels Layout, the topology layout parsing content according to the current page, including:Detect the title division in the Table labels;Take out Take in the Table labels except the various dimensions information of title division;The various dimensions information according to extracting judges topology layout; Business datum is obtained according to the topology layout.
Preferably, in the parsing module, the structure cloth of current page described in the content obtaining in the div tag Office, the topology layout parsing content according to the current page, including:Obtained and known business field from the div tag The label of name matching, and the position judgment topology layout according to the label for matching in div tag, obtain according to topology layout Take business datum.
Brief description of the drawings
Fig. 1 by the embodiment of the present invention provide for improving web crawlers stability, the data processing method of availability Flow chart;
Fig. 2 is the layout of the title division, remarks section and business datum part in an example table;
Fig. 3 is the example of many TL layouts in longitudinal direction;
Fig. 4 is the example of laterally many TL layouts;
Fig. 5 is an example for the form of many TL layouts cut merging;
Fig. 6 is an example for the form of many TL layouts cut merging;
Fig. 7 is the example processed the form of single TL (multistage) layouts;
The data processing equipment for improving web crawlers stability, availability that Fig. 8 is provided by the embodiment of the present invention Structured flowchart.
Specific embodiment
The embodiment of technical solution of the present invention is described in detail below in conjunction with accompanying drawing.Following examples are only used for Technical scheme is clearly illustrated, therefore is intended only as example, and protection of the invention can not be limited with this Scope.
It should be noted that unless otherwise indicated, technical term used in this application or scientific terminology should be this hair The ordinary meaning that bright one of ordinary skill in the art are understood.
Form in webpage is by HTML<table>Label is defined.The row of form by<tr>Tag definition,<tr>Must Must be at one<table></table>The inside, it is impossible to be used alone.Often row be divided into some cells, each cell by< td>Tag definition,<td>Needs are nested in<tr></tr>It is middle.<th>With<td>Equally it is also that needs are nested in<tr>It is central ,<th>...</th>For defining gauge outfit cell, comprising be Table Header information.Detailed directions are as follows:
The form that above-mentioned code shows in webpage is as follows:
Name Age
Zhang San 40
Div tag in HTML is used for subregion or section (division/section) in definition document.<div>Label can Independent, different parts are divided into document.It can serve as strict organization tool, and do not use any form with Its association.
Present embodiments provide it is a kind of for improving web crawlers stability, the data processing method of availability, such as Fig. 1 institutes Show, including:
Step S1, according to the feature specified in advance, judges whether current page there occurs that local structure changes.
Wherein, the feature specified in advance refers to the topology layout of webpage, shown as in HTML the type of label, position, Attribute etc..Structural variation refers to that the topology layout of the page there occurs change, such as:Certain label disappears, certain label Attribute there occurs change, or Table line numbers, columns have become.
Step S2, if non-recurring structure changes, obtains the topology layout of current page, according to the structure of current page Content in layout parsing current page.
Step S3, according to the mapping ruler being pre-configured with, to doing self organizing maps by parsing the service fields for obtaining name, And store to memory block.
Wherein, service fields name refers to the title name of each business datum, " performing law court " in such as Fig. 2, " execution case Number " etc..Self organizing maps refer to that will parse the service fields name for obtaining to replace with predetermined criteria field, to unify extracted data Service fields name, facilitate management and the statistics of follow-up data.It is to deposit for example by " enterprise name ", " organization names " automatic mapping " Business Name " of reservoir.
The present embodiment provide for improving web crawlers stability, the data processing method of availability, can know automatically The unstructuredness change of other Webpage, and using the data pick-up logic of self adaptation, without frequent maintenance, cost has been saved, The stability that web data is crawled is improve simultaneously, possesses more preferable universality.
Wherein, step S1 is specifically included:The corresponding label of the feature and current page specified in advance is compared one by one, if differing Cause, then it is assumed that current page there occurs that local structure changes.
Polytype label, such as Table labels, div tag may be included in HTML.The extracting method of different labels Difference, in order to adapt to the HTML of mixed type, step S2 is specifically included:
Step S21, obtains the html file of current page.
Step S22, the content in the content and div tag that extract in Table labels from html file.
Step S23, the topology layout of the content obtaining current page in Table labels, according to the knot of current page Structure layout parsing content.
Step S24, the topology layout of the content obtaining current page in div tag, according to the structure of current page Layout parsing content.
Form on webpage enters edlin by the way of HTML Table labels, and these information are mostly semi-structured numbers Although more regular according to the display effect on the page, bottom label and data are simultaneously irregular or even very chaotic, cause Title division and mixed in together with business datum, it is impossible to rapidly and accurately extract business datum.In order to automatic, quick, accurate Ground extracts the data in web page form, and step S23 is specifically included:
Step S231, the title division in detection Table labels.
As shown in Fig. 2 form caption part is generally all a big Merge Cells, it may be possible to a line or multirow, table Lattice can also include remarks section, and the structure of remarks section is similar with title division, remove title division and the remarks portion of form Point, remainder is exactly the business datum for needing to extract.When there is remarks section in form, detection is also needed in step S1 standby Note part, detection mode is identical with the detection mode of title division.
Step S232, except the various dimensions information of title division in extraction Table labels.
Wherein, various dimensions information includes:Direct content, th/td distributions, class property distributions, background-color Property distribution etc..Direct content is the content directly displayed in form in webpage, i.e.,<table>Content of text in label, such as " name ", " age ", " Zhang San ", " 40 ".Th/td distributions refer to distributing position of the th and td labels in this table.class Attribute specifies the class name of element in cell, and class property distributions refer to distributing position of the class attributes in this table. Background-color attributes define the background color of cell, and background-color property distributions refer to Distributing position of the background-color attributes in this table.
Step S233, topology layout is judged according to the various dimensions information for extracting.
Wherein, common table-layout is divided into laterally list TL, laterally many TL, longitudinal list TL, many TL in longitudinal direction, many table packs. TL (TitleLine) is row head (or data header part) (may be physically multirow, but be in logic a region), is represented The first row of the business datum part in the title of every business datum, such as Fig. 2 and be TL.TL can be horizontal, it is also possible to It is longitudinal, is illustrated in figure 3 many TL layouts in longitudinal direction, Fig. 4 is horizontal many TL layouts.
Step S234, business datum is obtained according to topology layout.
The method that step S23 provides structured message in adaptive decimation HTML Table labels, detects Table first Title division in label, eliminates the content for being not belonging to business datum part, prevents being mixed into for hash;Then extract Except the various dimensions information of title division in Table labels, the topology layout of form is judged according to various dimensions informix, due to Information in Table labels can reflect table-layout, therefore, no matter the form in webpage there occurs that what kind of changes, and pass through Information in Table labels is analyzed, new table-layout can be obtained.Therefore, the method that step S23 is provided, nothing The layout of form need in advance be known, the HTML Table for different structure need not again write program, solve existing Table extraction algorithms lack the problem of universality, while the reliability of extracted data is improve, especially to extensive semi-structured When data identification and extraction more effectively.
Title division and remarks section are general all in the first row or the second row of form, and are a Merge Cellses, because This, the specific implementation of step S231 includes:Detect whether per a line be a Merge Cells in Table labels, if It is that then detected row belongs to title division, and carries out the detection of next line;If it is not, represent that the row is initially business datum, Then stop the detection of title division.For example, the code of title division and remarks section is generally following form:
<tr><Td colspan=' 5 '>People information statistical form in 2016</td></tr>
Above-mentioned code only includes one<td>Label, and colspan=' 5 ' shows that this is a Merge Cells, leads to Cross detection<td>Just title division and remarks section can be recognized with colspan with industry.
Prior art, when hash (such as title division, remarks section) is filtered, it is necessary to be known a priori by hash Position, then specifies good position to skip first few lines hash in a program.And the method for the present embodiment is with more general Property, no matter how many row headers part of form and remarks section, accurately and efficiently it can be detected, to guarantee Business datum is drawn into exactly.
During extracted data, in addition to direct access standard element lattice corresponding informance, in addition it is also necessary to which Merge Cells is done Especially treatment, makes the data of extraction meet storage format, facilitates subsequent treatment, therefore, the preferred embodiment of step S232 includes:Take out Take except the various dimensions information of title division (have remarks section if, also including remarks section) in Table labels, to being extracted letter After Merge Cells in breath is split, then the information of each dimension is stored in two-dimensional array form respectively, and to splitting Cell do special marking.
Wherein, Merge Cells is divided into horizontal meaders (colspan), vertical consolidation (rowspan), mixing merging again (colspan+rowspan).For example:It is right<The bgcolor=of td colspan=' 5 ' " #F7FBFE ">ABC</td>Extract direct After content:
ABC {←} {←} {←} {←}
Wherein, special marking " { ← } " is that the direct content of extraction is distinctive, represents the content in the cell and its left side Content in cell is identical, in order to treatment and final content the output offer flexibility to TL, and other data Extraction need not do special marking.
Extracting ' background-color property distributions ' is:
#F7FBFE #F7FBFE #F7FBFE #F7FBFE #F7FBFE
When there is multiple horizontal meaders (colspan) in single file, in addition it is also necessary to note the problem of coordinate translation.For example <Td colspan=' 2 '>ABC</td><Td colspan=' 3 '>DEF</td>
ABC {←} DEF {←} {←}
It is also adopted by similar method and enters line number for vertical consolidation (rowspan), mixing merging (colspan+rowspan) According to extraction.
Only know table-layout, could exactly extract business datum, and be converted into form according to table-layout Structural data.Judgement table-layout in step S233 includes following several operations:
(1) according to the direct content for extracting, exclusion is not the row and column of TL.
Removing property judgement is carried out according to the data type of direct content, length, keyword in TL.Its basis for estimation includes: Field name length in TL each cell can not possibly exceed threshold value (such as 50), and the field name number of TL can not possibly exceed threshold Value (such as 1000), field name is unlikely to be pure digi-tal character string, and common field name includes " title ", " Name ", " The keyword such as location ", " Address ", " address ", " type ", " remarks ", keywords database is obtained according to common table statistics, inspection Whether survey in row or column comprising the keyword in keywords database.
Therefore, the step that implements for carrying out table-layout judgement based on direct content is:Line by line, detect what is extracted by column Direct content;If the data type of direct content is numeric type character string, row or column where direct content is not just TL;If straight The field length for connecing content exceedes first threshold, then row or column where direct content is not just TL;If a row or column is multinomial Comprising given keyword in direct content, then row or column is TL.
Wherein, when using the determination methods for being based on keyword, in order to ensure to judge reliability, at least need occur two Keyword is just it can be assumed that the row or column is TL.
(2) according to the background-color property distributions for extracting, table-layout is judged.
When form shows, in order to provide the user with the convenience of reading, the background colour meeting of the background colour and data of form TL Difference, or the parity rows of data can use background colour staggeredly, therefore, background-color property distributions can be used to Judge which row or column is probably TL, and then judge that table-layout is transverse direction or longitudinal direction.
(3) according to the class property distributions for extracting, table-layout is judged.
The cell for having identical class attributes is usually similar cell.If the class attributes of all row cells are equal Identical, then table-layout is landscape layout;If the class attribute all sames of all row cells, table-layout is longitudinal cloth Office, therefore, transverse direction or longitudinal direction may determine that by class property distributions.
(4) whether according to identical with the data type of the direct content in a line or same row, table-layout is judged.
TL parts are removed in the business datum of form, the data of the cell under each field name of TL are so long as not null value , their type should all be that (the method can only distinguish ' pure digi-tal ocra font ocr string ', ' date-time ocra font ocr to identical String ', ' without obvious characteristic character string ').Such as form in Fig. 2 is landscape layout, wherein business datum part, except the first row TL, the data type of remaining each column unit lattice is all identical, and such as field name " sequence number " this row are all pure digi-tal ocra font ocrs String, field name " performing law court " this row are all ' without obvious characteristic character string ' that field name " execution Reference Number " this row are all ' nothings Obvious characteristic character string ', in a word, in addition to TL rows, the data type of each row is all identical.
According to above-mentioned characteristic, detect whether all identical with the data type of a line, if the data type of all rows of the form All identical (either all it is ' pure digi-tal ocra font ocr string ' i.e. with the data type of all cells in a line, or all it is ' during the date Between ocra font ocr string ' or ' without obvious characteristic character string '), then the form for longitudinal direction be laid out;Whether the data type of detection same row It is all identical, the data type of all row of form it is all identical (or the data type of all cells is all ' cardinar number i.e. in same row Font character string ', or be all ' date-time ocra font ocr string ' or ' without obvious characteristic character string '), then the form is horizontal cloth Office.
Can be present the situation of null value in some cells, to avoid these cells from influenceing testing result, done to row and column During detection, content is that empty cell does not include detection range.
The data volume of form business datum part is typically more, and all row and columns are all carried out with detection can reduce judgement effect Rate, it is therefore possible to use short circuit judges, if the judged result of that is, new a line can negate certain layout, can jump out and sentence It is disconnected.
(5) it is distributed according to th/td, judges table-layout.
The cell quantity of TL is less than or equal to the cell quantity of other rows, and the cell quantity of non-TL should compare system One.According to the cell quantity of all row and columns of th/td distribution statisticses, the substantially few row or column of cell quantity may be TL, and be laterally or longitudinal according to TL, the quantity of TL can be obtained by table-layout.
Th is generally used to define title, and corresponding is exactly ' name ', field name as ' age '.The layout of th is likely to There is difference laterally, longitudinal, such as landscape layout is<Th colspan=' 3 '>List of results</th>, longitudinal direction is laid out and is<th Rowspan=' 3 '>List of results</th>.
Td can be used to define common cell, it is also possible to for defining title.
When having th labels and td labels simultaneously, table-layout is judged according to th distributions.But many table will not be specified Th, now judges table-layout according to td distributions.
After above-mentioned several methods for judging table-layout can carry out various combinations according to the actual requirements, table-layout is entered Row judgement, improves judging nicety rate;In addition, the method for the present embodiment is capable of identify that the situation of many TL in form, improves and extract The reliability of data.
When table-layout is longitudinal direction layout, in addition it is also necessary to which the form transposition for forming direct content is horizontal cloth Office.
TL is divided into single-stage TL, two kinds of multistage TL again, but in the case where illustrating, is all referred to as TL.Such as Fig. 2 institutes Show, only one of which TL and be single-stage TL.As shown in fig. 7, only one of which TL and for multistage TL (is made up of multirow, there is the superior and the subordinate Membership), the field name in this multirow need to be merged, form the output of single file field name.As shown in fig. 7, in former form TL points is two parts, and left-hand component is multirow (multistage), and right-hand component is single file, and the first order of multistage part is combining unit Lattice, field is entitled ' essential information ', and the second level of multistage part is ' name ', ' age ', ' sex ' field, final output single-stage TL, its structure is " essential information _ name ", " essential information _ age ", " essential information _ sex ", " other fields A ", " its He is field B ".
When table-layout is many TL, in addition it is also necessary to do cutting union operation to the form that direct content is formed, will Its layout for being converted to single TL, to meet the call format of structural data.Cutting union operation includes:Compare the straight of multiple TL Connect content;Content identical TL only retains a line TL, as shown in Figure 5;The different TL of content is spliced into TL in a row, such as Fig. 6 institutes Show.
Finally, for Merge Cells, special marking can be corrected according to service needed.For example
ABC {←} {←} {←} {←}
Can be adjusted to following form:
ABC ABC ABC ABC ABC
The method of structured message is directed to the extraction side of single form in above-mentioned adaptive decimation HTML Table labels Method, when there is multiple Table label (multiple forms) in webpage, only need to reuse above-mentioned adaptive decimation HTML Table The method of structured message in label, extracts each corresponding form of Table labels, and result then will be extracted at predetermined regular Merge.
For the data pick-up of div layouts, step S24 is specifically included:Obtained and known business field name from div tag The label of matching, and the position judgment topology layout according to label in div tag, business number is obtained according to topology layout According to.
It is known that service fields name can be previously given, or obtained according to the historical data statistics of parsing.Label It is the field name in div tag, " name ", " age " and " sex " such as in example one.Example one and example two are div layouts Form.Such as, " name ", " age " and " sex " these three words are extracted from div tag according to known business field name Section name, in example one, in the label on right side, then the topology layout that can determine the form is left and right to the label of extraction Key assignments layout (longitudinal direction layout);And in example two, the label of extraction can then determine the form in a row label Topology layout be top-bottom layout (landscape layout).
Example one
<div><div>Name</div><div>Zhang San</div></div>
<div><div>Age</div><div>18</div></div>
<div><div>Sex</div><div>Man</div></div>
Example two
<div><div>Name</div><div>Age</div><div>Sex</div></div>
<div><div>Zhang San</div><div>18</div><div>Man</div></div>
Based on the above-mentioned data processing method identical inventive concept for improving web crawlers stability, availability, The present embodiment additionally provides a kind of data processing equipment for improving web crawlers stability, availability, as shown in figure 8, bag Include:Structural variation detection module, for according to the feature specified in advance, judging whether current page there occurs local structure Change;Parsing module, if being changed for non-recurring structure, obtains the topology layout of current page, according to current page Content in topology layout parsing current page;Field self-adaptative adjustment module, according to the mapping ruler being pre-configured with, to passing through Parse the service fields name for obtaining and do self organizing maps, and store to memory block.
The present embodiment provide for improving web crawlers stability, the data processing method of availability, can know automatically The unstructuredness change of other Webpage, and using the data pick-up logic of self adaptation, without frequent maintenance, cost has been saved, The stability that web data is crawled is improve simultaneously, possesses more preferable universality.
Further, it is structural variation detection module specifically for:Feature and the current page specified in advance are compared one by one Corresponding label, if inconsistent, then it is assumed that current page there occurs local structure change.
Further, parsing module specifically for:Obtain the html file of current page;Extracted from html file The content in content and div tag in Table labels;The structure cloth of the content obtaining current page in Table labels Office, the topology layout parsing content according to current page;The topology layout of the content obtaining current page in div tag, Topology layout parsing content according to current page.
Further, in parsing module, the topology layout of the content obtaining current page in Table labels, according to The topology layout parsing content of current page, including:Title division in detection Table labels;Except mark in extraction Table labels Inscribe the various dimensions information of part;Various dimensions information according to extracting judges topology layout;Business datum is obtained according to topology layout.
Further, in parsing module, the topology layout of the content obtaining current page in div tag, according to work as The topology layout parsing content of the preceding page, including:The label that acquisition is matched with known business field name from div tag, and root Position judgment topology layout according to the label for matching in div tag, business datum is obtained according to topology layout.
Finally it should be noted that:Various embodiments above is merely illustrative of the technical solution of the present invention, rather than its limitations;To the greatest extent Pipe has been described in detail with reference to foregoing embodiments to the present invention, it will be understood by those within the art that:Its according to The technical scheme described in foregoing embodiments can so be modified, or which part or all technical characteristic are entered Row equivalent;And these modifications or replacement, the essence of appropriate technical solution is departed from various embodiments of the present invention technology The scope of scheme, it all should cover in the middle of the scope of claim of the invention and specification.

Claims (10)

1. a kind of for improving web crawlers stability, the data processing method of availability, it is characterised in that including:
Step S1, according to the feature specified in advance, judges whether current page there occurs that local structure changes;
Step S2, if non-recurring structure changes, obtains the topology layout of the current page, according to the current page Topology layout parses the content in the current page;
Step S3, according to the mapping ruler being pre-configured with, to doing self organizing maps by parsing the service fields for obtaining name, and deposits Store up to memory block.
2. method according to claim 1, it is characterised in that the step S1 includes:The spy for specifying in advance is compared one by one Seek peace the corresponding label of current page, if inconsistent, then it is assumed that the current page there occurs that local structure changes.
3. method according to claim 1, it is characterised in that the step S2 includes:
Obtain the html file of the current page;
Content in the content and div tag that extract in Table labels from the html file;
The topology layout of current page described in content obtaining in the Table labels, according to the knot of the current page Structure layout parsing content;
The topology layout of current page described in content obtaining in the div tag, according to the structure of the current page Layout parsing content.
4. method according to claim 3, it is characterised in that the content obtaining institute in the Table labels The topology layout of current page is stated, the topology layout parsing content according to the current page, including:
Detect the title division in the Table labels;
Extract in the Table labels except the various dimensions information of title division;
The various dimensions information according to extracting judges topology layout;
Business datum is obtained according to the topology layout.
5. method according to claim 3, it is characterised in that described in the content obtaining in the div tag The topology layout of current page, the topology layout parsing content according to the current page, including:Obtained from the div tag Take the label matched with known business field name, and the position judgment structure cloth according to the label for matching in div tag Office, business datum is obtained according to topology layout.
6. a kind of data processing equipment for improving web crawlers stability, availability, it is characterised in that including:
Structural variation detection module, for according to the feature specified in advance, judging whether current page there occurs partial structurtes Property change;
Parsing module, if being changed for non-recurring structure, obtains the topology layout of the current page, according to described current The topology layout of the page parses the content in the current page;
Field self-adaptative adjustment module, according to the mapping ruler being pre-configured with, to being done certainly by parsing the service fields for obtaining name Mapping is adapted to, and is stored to memory block.
7. device according to claim 5, it is characterised in that the structural variation detection module specifically for:One by one The corresponding label of the feature and current page specified in advance is compared, if inconsistent, then it is assumed that the current page there occurs part Structural variation.
8. device according to claim 5, it is characterised in that the parsing module specifically for:
Obtain the html file of the current page;
Content in the content and div tag that extract in Table labels from the html file;
The topology layout of current page described in content obtaining in the Table labels, according to the knot of the current page Structure layout parsing content;
The topology layout of current page described in content obtaining in the div tag, according to the structure of the current page Layout parsing content.
9. device according to claim 8, it is characterised in that in the parsing module, according in the Table labels The topology layout of current page described in content obtaining, the topology layout parsing content according to the current page, including:
Detect the title division in the Table labels;
Extract in the Table labels except the various dimensions information of title division;
The various dimensions information according to extracting judges topology layout;
Business datum is obtained according to the topology layout.
10. device according to claim 8, it is characterised in that in the parsing module, according in the div tag The topology layout of current page described in content obtaining, the topology layout parsing content according to the current page, including:From described The label that acquisition is matched with known business field name in div tag, and the position according to the label for matching in div tag Judge topology layout, business datum is obtained according to topology layout.
CN201611243842.5A 2016-12-29 2016-12-29 Data processing method and device for improving stability and usability of web crawler Active CN106777281B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611243842.5A CN106777281B (en) 2016-12-29 2016-12-29 Data processing method and device for improving stability and usability of web crawler

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611243842.5A CN106777281B (en) 2016-12-29 2016-12-29 Data processing method and device for improving stability and usability of web crawler

Publications (2)

Publication Number Publication Date
CN106777281A true CN106777281A (en) 2017-05-31
CN106777281B CN106777281B (en) 2020-07-17

Family

ID=58928579

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611243842.5A Active CN106777281B (en) 2016-12-29 2016-12-29 Data processing method and device for improving stability and usability of web crawler

Country Status (1)

Country Link
CN (1) CN106777281B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107463669A (en) * 2017-08-03 2017-12-12 深圳市华傲数据技术有限公司 The method and device for the web data that parsing reptile crawls
CN108647279A (en) * 2018-05-03 2018-10-12 山东浪潮通软信息科技有限公司 Sheet disposal method, apparatus, medium and storage control based on field multiplexing
CN109657125A (en) * 2018-12-14 2019-04-19 平安城市建设科技(深圳)有限公司 Data processing method, device, equipment and storage medium based on web crawlers
CN109948018A (en) * 2019-01-10 2019-06-28 北京大学 A kind of Web structural data rapid extracting method and system

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6938170B1 (en) * 2000-07-17 2005-08-30 International Business Machines Corporation System and method for preventing automated crawler access to web-based data sources using a dynamic data transcoding scheme
CN101576891A (en) * 2008-05-05 2009-11-11 北京瑞佳晨科技有限公司 Method for analyzing web page form object nodes
CN102254009A (en) * 2011-07-15 2011-11-23 福建星网锐捷通讯股份有限公司 Method for extracting data of webpage table
CN103198069A (en) * 2012-01-06 2013-07-10 株式会社理光 Method and device for extracting relational table
CN103942335A (en) * 2014-05-07 2014-07-23 武汉大学 Construction method of uninterrupted crawler system oriented to web page structure change
CN104598462A (en) * 2013-10-30 2015-05-06 深圳市国信互联科技有限公司 Method and device for extracting structural data
CN104767757A (en) * 2015-04-17 2015-07-08 国家电网公司 Multiple-dimension security monitoring method and system based on WEB services
CN105975395A (en) * 2016-05-30 2016-09-28 深圳市华傲数据技术有限公司 Website state reconnaissance method and device

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6938170B1 (en) * 2000-07-17 2005-08-30 International Business Machines Corporation System and method for preventing automated crawler access to web-based data sources using a dynamic data transcoding scheme
CN101576891A (en) * 2008-05-05 2009-11-11 北京瑞佳晨科技有限公司 Method for analyzing web page form object nodes
CN102254009A (en) * 2011-07-15 2011-11-23 福建星网锐捷通讯股份有限公司 Method for extracting data of webpage table
CN103198069A (en) * 2012-01-06 2013-07-10 株式会社理光 Method and device for extracting relational table
CN104598462A (en) * 2013-10-30 2015-05-06 深圳市国信互联科技有限公司 Method and device for extracting structural data
CN103942335A (en) * 2014-05-07 2014-07-23 武汉大学 Construction method of uninterrupted crawler system oriented to web page structure change
CN104767757A (en) * 2015-04-17 2015-07-08 国家电网公司 Multiple-dimension security monitoring method and system based on WEB services
CN105975395A (en) * 2016-05-30 2016-09-28 深圳市华傲数据技术有限公司 Website state reconnaissance method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
吴信才: "《不动产登记信息系统实用指南》", 31 October 2016 *
胡配祥: "《ASP.NET程序设计项目教程》", 31 July 2016 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107463669A (en) * 2017-08-03 2017-12-12 深圳市华傲数据技术有限公司 The method and device for the web data that parsing reptile crawls
CN107463669B (en) * 2017-08-03 2020-05-05 深圳市华傲数据技术有限公司 Method and device for analyzing webpage data crawled by crawler
CN108647279A (en) * 2018-05-03 2018-10-12 山东浪潮通软信息科技有限公司 Sheet disposal method, apparatus, medium and storage control based on field multiplexing
CN109657125A (en) * 2018-12-14 2019-04-19 平安城市建设科技(深圳)有限公司 Data processing method, device, equipment and storage medium based on web crawlers
CN109948018A (en) * 2019-01-10 2019-06-28 北京大学 A kind of Web structural data rapid extracting method and system
CN109948018B (en) * 2019-01-10 2021-05-25 北京大学 Method and system for rapidly extracting Web structured data

Also Published As

Publication number Publication date
CN106777281B (en) 2020-07-17

Similar Documents

Publication Publication Date Title
CN106777259A (en) The method and device of structured message in adaptive decimation HTML Table labels
CN110795919B (en) Form extraction method, device, equipment and medium in PDF document
CN106709032B (en) Method and device for extracting structured information in electronic form document
CN106156239B (en) Table extraction method and device
CN110968667B (en) Periodical and literature table extraction method based on text state characteristics
CN106777281A (en) For improving web crawlers stability, the data processing method of availability and device
CN108664574B (en) Information input method, terminal equipment and medium
US20030140311A1 (en) Method for content mining of semi-structured documents
CN111582169B (en) Image recognition data error correction method, device, computer equipment and storage medium
CN102314497B (en) Method and equipment for identifying body contents of markup language files
CN107016001A (en) A kind of data query method and device
CN101727461A (en) Method for extracting content of web page
CN110427488B (en) Document processing method and device
CN102737012A (en) Text information comparison method and system
CN109492177B (en) web page blocking method based on web page semantic structure
CN103440232A (en) Automatic sScientific paper standardization automatic detecting and editing method
Klampfl et al. An unsupervised machine learning approach to body text and table of contents extraction from digital scientific articles
CN107844468A (en) The cross-page recognition methods of form data, electronic equipment and computer-readable recording medium
CN109165373B (en) Data processing method and device
CN107315989A (en) For the text recognition method and device of medical information picture
US9280528B2 (en) Method and system for processing and learning rules for extracting information from incoming web pages
CN114201620A (en) Method, apparatus and medium for mining PDF tables in PDF file
CN110738050A (en) Text recombination method, device and medium based on word segmentation and named entity recognition
CN110390037B (en) Information classification method, device and equipment based on DOM tree and storage medium
CN107145947B (en) Information processing method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP02 Change in the address of a patent holder
CP02 Change in the address of a patent holder

Address after: 518000 2203/2204, Building 1, Huide Building, Beizhan Community, Minzhi Street, Longhua District, Shenzhen, Guangdong

Patentee after: SHENZHEN AUDAQUE DATA TECHNOLOGY Ltd.

Address before: 518000 units J and K, 12 / F, block B, building 7, Baoneng Science Park, Qinghu Industrial Zone, Qingxiang Road, Longhua New District, Shenzhen City, Guangdong Province

Patentee before: SHENZHEN AUDAQUE DATA TECHNOLOGY Ltd.