CN106777281A - For improving web crawlers stability, the data processing method of availability and device - Google Patents
For improving web crawlers stability, the data processing method of availability and device Download PDFInfo
- Publication number
- CN106777281A CN106777281A CN201611243842.5A CN201611243842A CN106777281A CN 106777281 A CN106777281 A CN 106777281A CN 201611243842 A CN201611243842 A CN 201611243842A CN 106777281 A CN106777281 A CN 106777281A
- Authority
- CN
- China
- Prior art keywords
- current page
- content
- topology layout
- layout
- parsing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Abstract
The present invention relates to a kind of for improving web crawlers stability, the data processing method of availability and device.The method that the present invention is provided, including:Step S1, according to the feature specified in advance, judges whether current page there occurs that local structure changes;Step S2, if non-recurring structure changes, obtains the topology layout of the current page, and the topology layout according to the current page parses the content in the current page;Step S3, according to the mapping ruler being pre-configured with, to doing self organizing maps by parsing the service fields for obtaining name, and stores to memory block.Provided by the present invention for improving web crawlers stability, the data processing method of availability and device, can be changed with the unstructuredness of automatic identification Webpage, and using the data pick-up logic of self adaptation, without frequent maintenance.
Description
Technical field
The present invention relates to technical field of data processing, and in particular to one kind is for improving web crawlers stability, availability
Data processing method and device.
Background technology
With the popularization and development of internet, e-commerce website, portal website, blog, various types of letters such as microblogging
Breath is all issued on the internet, and people by internet can collect magnanimity information and be analyzed, count, to obtain needs
Information.
Existing method is, using web crawlers technical limit spacing information, to remove the binary contents such as picture, video, and network is climbed
What worm typically obtained is webpage text content, and traditional reptile enters the solution of row information using regular expression, xpath or position
Analysis.
But the problem for existing is that webpage is dynamic change, such as:The position of service fields name/field value, the mark of html
Signing id, xpath path can may change at any time.The dynamic characteristic of webpage determines the characteristic of web crawlers frequent maintenance,
Therefore, existing web crawlers universality is poor, maintenance cost is very high.
The content of the invention
For defect of the prior art, provided by the present invention for improving web crawlers stability, the data of availability
Processing method and processing device, can be changed with the unstructuredness of automatic identification Webpage, and using the data pick-up logic of self adaptation,
Without frequent maintenance.
In a first aspect, present invention offer is a kind of for improving web crawlers stability, the data processing method of availability,
Including:Step S1, according to the feature specified in advance, judges whether current page there occurs that local structure changes;Step S2, if
Non- recurring structure changes, then obtain the topology layout of the current page, the topology layout parsing according to the current page
Content in the current page;Step S3, according to the mapping ruler being pre-configured with, to by parsing the service fields for obtaining name
Self organizing maps are done, and is stored to memory block.
Provided by the present invention for improving web crawlers stability, the data processing method of availability, can be with automatic identification
The unstructuredness change of Webpage, and using the data pick-up logic of self adaptation, without frequent maintenance, cost has been saved, together
When improve the stability that web data is crawled, possess more preferable universality.
Preferably, the step S1 includes:The corresponding label of the feature and current page specified in advance is compared one by one, if not
Unanimously, then it is assumed that the current page there occurs that local structure changes.
Preferably, the step S2 includes:Obtain the html file of the current page;Extracted from the html file
The content in content and div tag in Table labels;Current page described in content obtaining in the Table labels
Topology layout, according to the current page topology layout parsing content;Described in content obtaining in the div tag
The topology layout of current page, the topology layout parsing content according to the current page.
Preferably, the topology layout of current page described in the content obtaining in the Table labels, according to institute
The topology layout parsing content of current page is stated, including:Detect the title division in the Table labels;Extract the Table
Except the various dimensions information of title division in label;The various dimensions information according to extracting judges topology layout;According to the knot
Structure layout obtains business datum.
Preferably, the topology layout of current page described in the content obtaining in the div tag, according to described
The topology layout parsing content of current page, including:Obtain what is matched with known business field name from the div tag
Label, and the position judgment topology layout according to the label for matching in div tag, business number is obtained according to topology layout
According to.
Second aspect, a kind of data processing equipment for improving web crawlers stability, availability that the present invention is provided,
Including:Structural variation detection module, for according to the feature specified in advance, judging whether current page there occurs partial structurtes
Property change;Parsing module, if being changed for non-recurring structure, obtains the topology layout of the current page, according to described
The topology layout of current page parses the content in the current page;Field self-adaptative adjustment module, according to what is be pre-configured with
Mapping ruler, to doing self organizing maps by parsing the service fields for obtaining name, and stores to memory block.
Data processing equipment provided by the present invention for improving web crawlers stability, availability, can be with automatic identification
The unstructuredness change of Webpage, and using the data pick-up logic of self adaptation, without frequent maintenance, cost has been saved, together
When improve the stability that web data is crawled, possess more preferable universality.
Preferably, it is described it is structural variation detection module specifically for:Feature and the current page specified in advance are compared one by one
The corresponding label in face, if inconsistent, then it is assumed that the current page there occurs that local structure changes.
Preferably, the parsing module specifically for:Obtain the html file of the current page;From the html file
Extract the content in the content and div tag in Table labels;Described in content obtaining in the Table labels when
The topology layout of the preceding page, the topology layout parsing content according to the current page;Content in the div tag is obtained
The topology layout of the current page is taken, the topology layout parsing content according to the current page.
Preferably, in the parsing module, the structure of current page described in the content obtaining in the Table labels
Layout, the topology layout parsing content according to the current page, including:Detect the title division in the Table labels;Take out
Take in the Table labels except the various dimensions information of title division;The various dimensions information according to extracting judges topology layout;
Business datum is obtained according to the topology layout.
Preferably, in the parsing module, the structure cloth of current page described in the content obtaining in the div tag
Office, the topology layout parsing content according to the current page, including:Obtained and known business field from the div tag
The label of name matching, and the position judgment topology layout according to the label for matching in div tag, obtain according to topology layout
Take business datum.
Brief description of the drawings
Fig. 1 by the embodiment of the present invention provide for improving web crawlers stability, the data processing method of availability
Flow chart;
Fig. 2 is the layout of the title division, remarks section and business datum part in an example table;
Fig. 3 is the example of many TL layouts in longitudinal direction;
Fig. 4 is the example of laterally many TL layouts;
Fig. 5 is an example for the form of many TL layouts cut merging;
Fig. 6 is an example for the form of many TL layouts cut merging;
Fig. 7 is the example processed the form of single TL (multistage) layouts;
The data processing equipment for improving web crawlers stability, availability that Fig. 8 is provided by the embodiment of the present invention
Structured flowchart.
Specific embodiment
The embodiment of technical solution of the present invention is described in detail below in conjunction with accompanying drawing.Following examples are only used for
Technical scheme is clearly illustrated, therefore is intended only as example, and protection of the invention can not be limited with this
Scope.
It should be noted that unless otherwise indicated, technical term used in this application or scientific terminology should be this hair
The ordinary meaning that bright one of ordinary skill in the art are understood.
Form in webpage is by HTML<table>Label is defined.The row of form by<tr>Tag definition,<tr>Must
Must be at one<table></table>The inside, it is impossible to be used alone.Often row be divided into some cells, each cell by<
td>Tag definition,<td>Needs are nested in<tr></tr>It is middle.<th>With<td>Equally it is also that needs are nested in<tr>It is central
,<th>...</th>For defining gauge outfit cell, comprising be Table Header information.Detailed directions are as follows:
The form that above-mentioned code shows in webpage is as follows:
Name | Age |
Zhang San | 40 |
Div tag in HTML is used for subregion or section (division/section) in definition document.<div>Label can
Independent, different parts are divided into document.It can serve as strict organization tool, and do not use any form with
Its association.
Present embodiments provide it is a kind of for improving web crawlers stability, the data processing method of availability, such as Fig. 1 institutes
Show, including:
Step S1, according to the feature specified in advance, judges whether current page there occurs that local structure changes.
Wherein, the feature specified in advance refers to the topology layout of webpage, shown as in HTML the type of label, position,
Attribute etc..Structural variation refers to that the topology layout of the page there occurs change, such as:Certain label disappears, certain label
Attribute there occurs change, or Table line numbers, columns have become.
Step S2, if non-recurring structure changes, obtains the topology layout of current page, according to the structure of current page
Content in layout parsing current page.
Step S3, according to the mapping ruler being pre-configured with, to doing self organizing maps by parsing the service fields for obtaining name,
And store to memory block.
Wherein, service fields name refers to the title name of each business datum, " performing law court " in such as Fig. 2, " execution case
Number " etc..Self organizing maps refer to that will parse the service fields name for obtaining to replace with predetermined criteria field, to unify extracted data
Service fields name, facilitate management and the statistics of follow-up data.It is to deposit for example by " enterprise name ", " organization names " automatic mapping
" Business Name " of reservoir.
The present embodiment provide for improving web crawlers stability, the data processing method of availability, can know automatically
The unstructuredness change of other Webpage, and using the data pick-up logic of self adaptation, without frequent maintenance, cost has been saved,
The stability that web data is crawled is improve simultaneously, possesses more preferable universality.
Wherein, step S1 is specifically included:The corresponding label of the feature and current page specified in advance is compared one by one, if differing
Cause, then it is assumed that current page there occurs that local structure changes.
Polytype label, such as Table labels, div tag may be included in HTML.The extracting method of different labels
Difference, in order to adapt to the HTML of mixed type, step S2 is specifically included:
Step S21, obtains the html file of current page.
Step S22, the content in the content and div tag that extract in Table labels from html file.
Step S23, the topology layout of the content obtaining current page in Table labels, according to the knot of current page
Structure layout parsing content.
Step S24, the topology layout of the content obtaining current page in div tag, according to the structure of current page
Layout parsing content.
Form on webpage enters edlin by the way of HTML Table labels, and these information are mostly semi-structured numbers
Although more regular according to the display effect on the page, bottom label and data are simultaneously irregular or even very chaotic, cause
Title division and mixed in together with business datum, it is impossible to rapidly and accurately extract business datum.In order to automatic, quick, accurate
Ground extracts the data in web page form, and step S23 is specifically included:
Step S231, the title division in detection Table labels.
As shown in Fig. 2 form caption part is generally all a big Merge Cells, it may be possible to a line or multirow, table
Lattice can also include remarks section, and the structure of remarks section is similar with title division, remove title division and the remarks portion of form
Point, remainder is exactly the business datum for needing to extract.When there is remarks section in form, detection is also needed in step S1 standby
Note part, detection mode is identical with the detection mode of title division.
Step S232, except the various dimensions information of title division in extraction Table labels.
Wherein, various dimensions information includes:Direct content, th/td distributions, class property distributions, background-color
Property distribution etc..Direct content is the content directly displayed in form in webpage, i.e.,<table>Content of text in label, such as
" name ", " age ", " Zhang San ", " 40 ".Th/td distributions refer to distributing position of the th and td labels in this table.class
Attribute specifies the class name of element in cell, and class property distributions refer to distributing position of the class attributes in this table.
Background-color attributes define the background color of cell, and background-color property distributions refer to
Distributing position of the background-color attributes in this table.
Step S233, topology layout is judged according to the various dimensions information for extracting.
Wherein, common table-layout is divided into laterally list TL, laterally many TL, longitudinal list TL, many TL in longitudinal direction, many table packs.
TL (TitleLine) is row head (or data header part) (may be physically multirow, but be in logic a region), is represented
The first row of the business datum part in the title of every business datum, such as Fig. 2 and be TL.TL can be horizontal, it is also possible to
It is longitudinal, is illustrated in figure 3 many TL layouts in longitudinal direction, Fig. 4 is horizontal many TL layouts.
Step S234, business datum is obtained according to topology layout.
The method that step S23 provides structured message in adaptive decimation HTML Table labels, detects Table first
Title division in label, eliminates the content for being not belonging to business datum part, prevents being mixed into for hash;Then extract
Except the various dimensions information of title division in Table labels, the topology layout of form is judged according to various dimensions informix, due to
Information in Table labels can reflect table-layout, therefore, no matter the form in webpage there occurs that what kind of changes, and pass through
Information in Table labels is analyzed, new table-layout can be obtained.Therefore, the method that step S23 is provided, nothing
The layout of form need in advance be known, the HTML Table for different structure need not again write program, solve existing
Table extraction algorithms lack the problem of universality, while the reliability of extracted data is improve, especially to extensive semi-structured
When data identification and extraction more effectively.
Title division and remarks section are general all in the first row or the second row of form, and are a Merge Cellses, because
This, the specific implementation of step S231 includes:Detect whether per a line be a Merge Cells in Table labels, if
It is that then detected row belongs to title division, and carries out the detection of next line;If it is not, represent that the row is initially business datum,
Then stop the detection of title division.For example, the code of title division and remarks section is generally following form:
<tr><Td colspan=' 5 '>People information statistical form in 2016</td></tr>
Above-mentioned code only includes one<td>Label, and colspan=' 5 ' shows that this is a Merge Cells, leads to
Cross detection<td>Just title division and remarks section can be recognized with colspan with industry.
Prior art, when hash (such as title division, remarks section) is filtered, it is necessary to be known a priori by hash
Position, then specifies good position to skip first few lines hash in a program.And the method for the present embodiment is with more general
Property, no matter how many row headers part of form and remarks section, accurately and efficiently it can be detected, to guarantee
Business datum is drawn into exactly.
During extracted data, in addition to direct access standard element lattice corresponding informance, in addition it is also necessary to which Merge Cells is done
Especially treatment, makes the data of extraction meet storage format, facilitates subsequent treatment, therefore, the preferred embodiment of step S232 includes:Take out
Take except the various dimensions information of title division (have remarks section if, also including remarks section) in Table labels, to being extracted letter
After Merge Cells in breath is split, then the information of each dimension is stored in two-dimensional array form respectively, and to splitting
Cell do special marking.
Wherein, Merge Cells is divided into horizontal meaders (colspan), vertical consolidation (rowspan), mixing merging again
(colspan+rowspan).For example:It is right<The bgcolor=of td colspan=' 5 ' " #F7FBFE ">ABC</td>Extract direct
After content:
ABC | {←} | {←} | {←} | {←} |
Wherein, special marking " { ← } " is that the direct content of extraction is distinctive, represents the content in the cell and its left side
Content in cell is identical, in order to treatment and final content the output offer flexibility to TL, and other data
Extraction need not do special marking.
Extracting ' background-color property distributions ' is:
#F7FBFE | #F7FBFE | #F7FBFE | #F7FBFE | #F7FBFE |
When there is multiple horizontal meaders (colspan) in single file, in addition it is also necessary to note the problem of coordinate translation.For example
<Td colspan=' 2 '>ABC</td><Td colspan=' 3 '>DEF</td>
ABC | {←} | DEF | {←} | {←} |
It is also adopted by similar method and enters line number for vertical consolidation (rowspan), mixing merging (colspan+rowspan)
According to extraction.
Only know table-layout, could exactly extract business datum, and be converted into form according to table-layout
Structural data.Judgement table-layout in step S233 includes following several operations:
(1) according to the direct content for extracting, exclusion is not the row and column of TL.
Removing property judgement is carried out according to the data type of direct content, length, keyword in TL.Its basis for estimation includes:
Field name length in TL each cell can not possibly exceed threshold value (such as 50), and the field name number of TL can not possibly exceed threshold
Value (such as 1000), field name is unlikely to be pure digi-tal character string, and common field name includes " title ", " Name ", "
The keyword such as location ", " Address ", " address ", " type ", " remarks ", keywords database is obtained according to common table statistics, inspection
Whether survey in row or column comprising the keyword in keywords database.
Therefore, the step that implements for carrying out table-layout judgement based on direct content is:Line by line, detect what is extracted by column
Direct content;If the data type of direct content is numeric type character string, row or column where direct content is not just TL;If straight
The field length for connecing content exceedes first threshold, then row or column where direct content is not just TL;If a row or column is multinomial
Comprising given keyword in direct content, then row or column is TL.
Wherein, when using the determination methods for being based on keyword, in order to ensure to judge reliability, at least need occur two
Keyword is just it can be assumed that the row or column is TL.
(2) according to the background-color property distributions for extracting, table-layout is judged.
When form shows, in order to provide the user with the convenience of reading, the background colour meeting of the background colour and data of form TL
Difference, or the parity rows of data can use background colour staggeredly, therefore, background-color property distributions can be used to
Judge which row or column is probably TL, and then judge that table-layout is transverse direction or longitudinal direction.
(3) according to the class property distributions for extracting, table-layout is judged.
The cell for having identical class attributes is usually similar cell.If the class attributes of all row cells are equal
Identical, then table-layout is landscape layout;If the class attribute all sames of all row cells, table-layout is longitudinal cloth
Office, therefore, transverse direction or longitudinal direction may determine that by class property distributions.
(4) whether according to identical with the data type of the direct content in a line or same row, table-layout is judged.
TL parts are removed in the business datum of form, the data of the cell under each field name of TL are so long as not null value
, their type should all be that (the method can only distinguish ' pure digi-tal ocra font ocr string ', ' date-time ocra font ocr to identical
String ', ' without obvious characteristic character string ').Such as form in Fig. 2 is landscape layout, wherein business datum part, except the first row
TL, the data type of remaining each column unit lattice is all identical, and such as field name " sequence number " this row are all pure digi-tal ocra font ocrs
String, field name " performing law court " this row are all ' without obvious characteristic character string ' that field name " execution Reference Number " this row are all ' nothings
Obvious characteristic character string ', in a word, in addition to TL rows, the data type of each row is all identical.
According to above-mentioned characteristic, detect whether all identical with the data type of a line, if the data type of all rows of the form
All identical (either all it is ' pure digi-tal ocra font ocr string ' i.e. with the data type of all cells in a line, or all it is ' during the date
Between ocra font ocr string ' or ' without obvious characteristic character string '), then the form for longitudinal direction be laid out;Whether the data type of detection same row
It is all identical, the data type of all row of form it is all identical (or the data type of all cells is all ' cardinar number i.e. in same row
Font character string ', or be all ' date-time ocra font ocr string ' or ' without obvious characteristic character string '), then the form is horizontal cloth
Office.
Can be present the situation of null value in some cells, to avoid these cells from influenceing testing result, done to row and column
During detection, content is that empty cell does not include detection range.
The data volume of form business datum part is typically more, and all row and columns are all carried out with detection can reduce judgement effect
Rate, it is therefore possible to use short circuit judges, if the judged result of that is, new a line can negate certain layout, can jump out and sentence
It is disconnected.
(5) it is distributed according to th/td, judges table-layout.
The cell quantity of TL is less than or equal to the cell quantity of other rows, and the cell quantity of non-TL should compare system
One.According to the cell quantity of all row and columns of th/td distribution statisticses, the substantially few row or column of cell quantity may be
TL, and be laterally or longitudinal according to TL, the quantity of TL can be obtained by table-layout.
Th is generally used to define title, and corresponding is exactly ' name ', field name as ' age '.The layout of th is likely to
There is difference laterally, longitudinal, such as landscape layout is<Th colspan=' 3 '>List of results</th>, longitudinal direction is laid out and is<th
Rowspan=' 3 '>List of results</th>.
Td can be used to define common cell, it is also possible to for defining title.
When having th labels and td labels simultaneously, table-layout is judged according to th distributions.But many table will not be specified
Th, now judges table-layout according to td distributions.
After above-mentioned several methods for judging table-layout can carry out various combinations according to the actual requirements, table-layout is entered
Row judgement, improves judging nicety rate;In addition, the method for the present embodiment is capable of identify that the situation of many TL in form, improves and extract
The reliability of data.
When table-layout is longitudinal direction layout, in addition it is also necessary to which the form transposition for forming direct content is horizontal cloth
Office.
TL is divided into single-stage TL, two kinds of multistage TL again, but in the case where illustrating, is all referred to as TL.Such as Fig. 2 institutes
Show, only one of which TL and be single-stage TL.As shown in fig. 7, only one of which TL and for multistage TL (is made up of multirow, there is the superior and the subordinate
Membership), the field name in this multirow need to be merged, form the output of single file field name.As shown in fig. 7, in former form
TL points is two parts, and left-hand component is multirow (multistage), and right-hand component is single file, and the first order of multistage part is combining unit
Lattice, field is entitled ' essential information ', and the second level of multistage part is ' name ', ' age ', ' sex ' field, final output single-stage
TL, its structure is " essential information _ name ", " essential information _ age ", " essential information _ sex ", " other fields A ", " its
He is field B ".
When table-layout is many TL, in addition it is also necessary to do cutting union operation to the form that direct content is formed, will
Its layout for being converted to single TL, to meet the call format of structural data.Cutting union operation includes:Compare the straight of multiple TL
Connect content;Content identical TL only retains a line TL, as shown in Figure 5;The different TL of content is spliced into TL in a row, such as Fig. 6 institutes
Show.
Finally, for Merge Cells, special marking can be corrected according to service needed.For example
ABC | {←} | {←} | {←} | {←} |
Can be adjusted to following form:
ABC | ABC | ABC | ABC | ABC |
The method of structured message is directed to the extraction side of single form in above-mentioned adaptive decimation HTML Table labels
Method, when there is multiple Table label (multiple forms) in webpage, only need to reuse above-mentioned adaptive decimation HTML Table
The method of structured message in label, extracts each corresponding form of Table labels, and result then will be extracted at predetermined regular
Merge.
For the data pick-up of div layouts, step S24 is specifically included:Obtained and known business field name from div tag
The label of matching, and the position judgment topology layout according to label in div tag, business number is obtained according to topology layout
According to.
It is known that service fields name can be previously given, or obtained according to the historical data statistics of parsing.Label
It is the field name in div tag, " name ", " age " and " sex " such as in example one.Example one and example two are div layouts
Form.Such as, " name ", " age " and " sex " these three words are extracted from div tag according to known business field name
Section name, in example one, in the label on right side, then the topology layout that can determine the form is left and right to the label of extraction
Key assignments layout (longitudinal direction layout);And in example two, the label of extraction can then determine the form in a row label
Topology layout be top-bottom layout (landscape layout).
Example one
<div><div>Name</div><div>Zhang San</div></div>
<div><div>Age</div><div>18</div></div>
<div><div>Sex</div><div>Man</div></div>
Example two
<div><div>Name</div><div>Age</div><div>Sex</div></div>
<div><div>Zhang San</div><div>18</div><div>Man</div></div>
Based on the above-mentioned data processing method identical inventive concept for improving web crawlers stability, availability,
The present embodiment additionally provides a kind of data processing equipment for improving web crawlers stability, availability, as shown in figure 8, bag
Include:Structural variation detection module, for according to the feature specified in advance, judging whether current page there occurs local structure
Change;Parsing module, if being changed for non-recurring structure, obtains the topology layout of current page, according to current page
Content in topology layout parsing current page;Field self-adaptative adjustment module, according to the mapping ruler being pre-configured with, to passing through
Parse the service fields name for obtaining and do self organizing maps, and store to memory block.
The present embodiment provide for improving web crawlers stability, the data processing method of availability, can know automatically
The unstructuredness change of other Webpage, and using the data pick-up logic of self adaptation, without frequent maintenance, cost has been saved,
The stability that web data is crawled is improve simultaneously, possesses more preferable universality.
Further, it is structural variation detection module specifically for:Feature and the current page specified in advance are compared one by one
Corresponding label, if inconsistent, then it is assumed that current page there occurs local structure change.
Further, parsing module specifically for:Obtain the html file of current page;Extracted from html file
The content in content and div tag in Table labels;The structure cloth of the content obtaining current page in Table labels
Office, the topology layout parsing content according to current page;The topology layout of the content obtaining current page in div tag,
Topology layout parsing content according to current page.
Further, in parsing module, the topology layout of the content obtaining current page in Table labels, according to
The topology layout parsing content of current page, including:Title division in detection Table labels;Except mark in extraction Table labels
Inscribe the various dimensions information of part;Various dimensions information according to extracting judges topology layout;Business datum is obtained according to topology layout.
Further, in parsing module, the topology layout of the content obtaining current page in div tag, according to work as
The topology layout parsing content of the preceding page, including:The label that acquisition is matched with known business field name from div tag, and root
Position judgment topology layout according to the label for matching in div tag, business datum is obtained according to topology layout.
Finally it should be noted that:Various embodiments above is merely illustrative of the technical solution of the present invention, rather than its limitations;To the greatest extent
Pipe has been described in detail with reference to foregoing embodiments to the present invention, it will be understood by those within the art that:Its according to
The technical scheme described in foregoing embodiments can so be modified, or which part or all technical characteristic are entered
Row equivalent;And these modifications or replacement, the essence of appropriate technical solution is departed from various embodiments of the present invention technology
The scope of scheme, it all should cover in the middle of the scope of claim of the invention and specification.
Claims (10)
1. a kind of for improving web crawlers stability, the data processing method of availability, it is characterised in that including:
Step S1, according to the feature specified in advance, judges whether current page there occurs that local structure changes;
Step S2, if non-recurring structure changes, obtains the topology layout of the current page, according to the current page
Topology layout parses the content in the current page;
Step S3, according to the mapping ruler being pre-configured with, to doing self organizing maps by parsing the service fields for obtaining name, and deposits
Store up to memory block.
2. method according to claim 1, it is characterised in that the step S1 includes:The spy for specifying in advance is compared one by one
Seek peace the corresponding label of current page, if inconsistent, then it is assumed that the current page there occurs that local structure changes.
3. method according to claim 1, it is characterised in that the step S2 includes:
Obtain the html file of the current page;
Content in the content and div tag that extract in Table labels from the html file;
The topology layout of current page described in content obtaining in the Table labels, according to the knot of the current page
Structure layout parsing content;
The topology layout of current page described in content obtaining in the div tag, according to the structure of the current page
Layout parsing content.
4. method according to claim 3, it is characterised in that the content obtaining institute in the Table labels
The topology layout of current page is stated, the topology layout parsing content according to the current page, including:
Detect the title division in the Table labels;
Extract in the Table labels except the various dimensions information of title division;
The various dimensions information according to extracting judges topology layout;
Business datum is obtained according to the topology layout.
5. method according to claim 3, it is characterised in that described in the content obtaining in the div tag
The topology layout of current page, the topology layout parsing content according to the current page, including:Obtained from the div tag
Take the label matched with known business field name, and the position judgment structure cloth according to the label for matching in div tag
Office, business datum is obtained according to topology layout.
6. a kind of data processing equipment for improving web crawlers stability, availability, it is characterised in that including:
Structural variation detection module, for according to the feature specified in advance, judging whether current page there occurs partial structurtes
Property change;
Parsing module, if being changed for non-recurring structure, obtains the topology layout of the current page, according to described current
The topology layout of the page parses the content in the current page;
Field self-adaptative adjustment module, according to the mapping ruler being pre-configured with, to being done certainly by parsing the service fields for obtaining name
Mapping is adapted to, and is stored to memory block.
7. device according to claim 5, it is characterised in that the structural variation detection module specifically for:One by one
The corresponding label of the feature and current page specified in advance is compared, if inconsistent, then it is assumed that the current page there occurs part
Structural variation.
8. device according to claim 5, it is characterised in that the parsing module specifically for:
Obtain the html file of the current page;
Content in the content and div tag that extract in Table labels from the html file;
The topology layout of current page described in content obtaining in the Table labels, according to the knot of the current page
Structure layout parsing content;
The topology layout of current page described in content obtaining in the div tag, according to the structure of the current page
Layout parsing content.
9. device according to claim 8, it is characterised in that in the parsing module, according in the Table labels
The topology layout of current page described in content obtaining, the topology layout parsing content according to the current page, including:
Detect the title division in the Table labels;
Extract in the Table labels except the various dimensions information of title division;
The various dimensions information according to extracting judges topology layout;
Business datum is obtained according to the topology layout.
10. device according to claim 8, it is characterised in that in the parsing module, according in the div tag
The topology layout of current page described in content obtaining, the topology layout parsing content according to the current page, including:From described
The label that acquisition is matched with known business field name in div tag, and the position according to the label for matching in div tag
Judge topology layout, business datum is obtained according to topology layout.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611243842.5A CN106777281B (en) | 2016-12-29 | 2016-12-29 | Data processing method and device for improving stability and usability of web crawler |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611243842.5A CN106777281B (en) | 2016-12-29 | 2016-12-29 | Data processing method and device for improving stability and usability of web crawler |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106777281A true CN106777281A (en) | 2017-05-31 |
CN106777281B CN106777281B (en) | 2020-07-17 |
Family
ID=58928579
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611243842.5A Active CN106777281B (en) | 2016-12-29 | 2016-12-29 | Data processing method and device for improving stability and usability of web crawler |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106777281B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107463669A (en) * | 2017-08-03 | 2017-12-12 | 深圳市华傲数据技术有限公司 | The method and device for the web data that parsing reptile crawls |
CN108647279A (en) * | 2018-05-03 | 2018-10-12 | 山东浪潮通软信息科技有限公司 | Sheet disposal method, apparatus, medium and storage control based on field multiplexing |
CN109657125A (en) * | 2018-12-14 | 2019-04-19 | 平安城市建设科技(深圳)有限公司 | Data processing method, device, equipment and storage medium based on web crawlers |
CN109948018A (en) * | 2019-01-10 | 2019-06-28 | 北京大学 | A kind of Web structural data rapid extracting method and system |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6938170B1 (en) * | 2000-07-17 | 2005-08-30 | International Business Machines Corporation | System and method for preventing automated crawler access to web-based data sources using a dynamic data transcoding scheme |
CN101576891A (en) * | 2008-05-05 | 2009-11-11 | 北京瑞佳晨科技有限公司 | Method for analyzing web page form object nodes |
CN102254009A (en) * | 2011-07-15 | 2011-11-23 | 福建星网锐捷通讯股份有限公司 | Method for extracting data of webpage table |
CN103198069A (en) * | 2012-01-06 | 2013-07-10 | 株式会社理光 | Method and device for extracting relational table |
CN103942335A (en) * | 2014-05-07 | 2014-07-23 | 武汉大学 | Construction method of uninterrupted crawler system oriented to web page structure change |
CN104598462A (en) * | 2013-10-30 | 2015-05-06 | 深圳市国信互联科技有限公司 | Method and device for extracting structural data |
CN104767757A (en) * | 2015-04-17 | 2015-07-08 | 国家电网公司 | Multiple-dimension security monitoring method and system based on WEB services |
CN105975395A (en) * | 2016-05-30 | 2016-09-28 | 深圳市华傲数据技术有限公司 | Website state reconnaissance method and device |
-
2016
- 2016-12-29 CN CN201611243842.5A patent/CN106777281B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6938170B1 (en) * | 2000-07-17 | 2005-08-30 | International Business Machines Corporation | System and method for preventing automated crawler access to web-based data sources using a dynamic data transcoding scheme |
CN101576891A (en) * | 2008-05-05 | 2009-11-11 | 北京瑞佳晨科技有限公司 | Method for analyzing web page form object nodes |
CN102254009A (en) * | 2011-07-15 | 2011-11-23 | 福建星网锐捷通讯股份有限公司 | Method for extracting data of webpage table |
CN103198069A (en) * | 2012-01-06 | 2013-07-10 | 株式会社理光 | Method and device for extracting relational table |
CN104598462A (en) * | 2013-10-30 | 2015-05-06 | 深圳市国信互联科技有限公司 | Method and device for extracting structural data |
CN103942335A (en) * | 2014-05-07 | 2014-07-23 | 武汉大学 | Construction method of uninterrupted crawler system oriented to web page structure change |
CN104767757A (en) * | 2015-04-17 | 2015-07-08 | 国家电网公司 | Multiple-dimension security monitoring method and system based on WEB services |
CN105975395A (en) * | 2016-05-30 | 2016-09-28 | 深圳市华傲数据技术有限公司 | Website state reconnaissance method and device |
Non-Patent Citations (2)
Title |
---|
吴信才: "《不动产登记信息系统实用指南》", 31 October 2016 * |
胡配祥: "《ASP.NET程序设计项目教程》", 31 July 2016 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107463669A (en) * | 2017-08-03 | 2017-12-12 | 深圳市华傲数据技术有限公司 | The method and device for the web data that parsing reptile crawls |
CN107463669B (en) * | 2017-08-03 | 2020-05-05 | 深圳市华傲数据技术有限公司 | Method and device for analyzing webpage data crawled by crawler |
CN108647279A (en) * | 2018-05-03 | 2018-10-12 | 山东浪潮通软信息科技有限公司 | Sheet disposal method, apparatus, medium and storage control based on field multiplexing |
CN109657125A (en) * | 2018-12-14 | 2019-04-19 | 平安城市建设科技(深圳)有限公司 | Data processing method, device, equipment and storage medium based on web crawlers |
CN109948018A (en) * | 2019-01-10 | 2019-06-28 | 北京大学 | A kind of Web structural data rapid extracting method and system |
CN109948018B (en) * | 2019-01-10 | 2021-05-25 | 北京大学 | Method and system for rapidly extracting Web structured data |
Also Published As
Publication number | Publication date |
---|---|
CN106777281B (en) | 2020-07-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106777259A (en) | The method and device of structured message in adaptive decimation HTML Table labels | |
CN110795919B (en) | Form extraction method, device, equipment and medium in PDF document | |
CN106709032B (en) | Method and device for extracting structured information in electronic form document | |
CN106156239B (en) | Table extraction method and device | |
CN110968667B (en) | Periodical and literature table extraction method based on text state characteristics | |
CN106777281A (en) | For improving web crawlers stability, the data processing method of availability and device | |
CN108664574B (en) | Information input method, terminal equipment and medium | |
US20030140311A1 (en) | Method for content mining of semi-structured documents | |
CN111582169B (en) | Image recognition data error correction method, device, computer equipment and storage medium | |
CN102314497B (en) | Method and equipment for identifying body contents of markup language files | |
CN107016001A (en) | A kind of data query method and device | |
CN101727461A (en) | Method for extracting content of web page | |
CN110427488B (en) | Document processing method and device | |
CN102737012A (en) | Text information comparison method and system | |
CN109492177B (en) | web page blocking method based on web page semantic structure | |
CN103440232A (en) | Automatic sScientific paper standardization automatic detecting and editing method | |
Klampfl et al. | An unsupervised machine learning approach to body text and table of contents extraction from digital scientific articles | |
CN107844468A (en) | The cross-page recognition methods of form data, electronic equipment and computer-readable recording medium | |
CN109165373B (en) | Data processing method and device | |
CN107315989A (en) | For the text recognition method and device of medical information picture | |
US9280528B2 (en) | Method and system for processing and learning rules for extracting information from incoming web pages | |
CN114201620A (en) | Method, apparatus and medium for mining PDF tables in PDF file | |
CN110738050A (en) | Text recombination method, device and medium based on word segmentation and named entity recognition | |
CN110390037B (en) | Information classification method, device and equipment based on DOM tree and storage medium | |
CN107145947B (en) | Information processing method and device and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CP02 | Change in the address of a patent holder | ||
CP02 | Change in the address of a patent holder |
Address after: 518000 2203/2204, Building 1, Huide Building, Beizhan Community, Minzhi Street, Longhua District, Shenzhen, Guangdong Patentee after: SHENZHEN AUDAQUE DATA TECHNOLOGY Ltd. Address before: 518000 units J and K, 12 / F, block B, building 7, Baoneng Science Park, Qinghu Industrial Zone, Qingxiang Road, Longhua New District, Shenzhen City, Guangdong Province Patentee before: SHENZHEN AUDAQUE DATA TECHNOLOGY Ltd. |