CN106777259A

CN106777259A - The method and device of structured message in adaptive decimation HTML Table labels

Info

Publication number: CN106777259A
Application number: CN201611234018.3A
Authority: CN
Inventors: 张军; 贾西贝
Original assignee: Shenzhen Huaao Data Technology Co Ltd
Current assignee: Shenzhen Huaao Data Technology Co Ltd
Priority date: 2016-12-28
Filing date: 2016-12-28
Publication date: 2017-05-31

Abstract

The invention belongs to technical field of data processing, and in particular to the method and device of structured message in a kind of adaptive decimation HTML Table labels.The method of structured message in the adaptive decimation HTML Table labels that the present invention is provided, including：Title division in detection Table labels；Extract in the Table labels except the various dimensions information of title division；The various dimensions information according to extracting judges table-layout；According to the table-layout, postpositive disposal is carried out to the direct content in the various dimensions information, obtain structural data.The method and device of structured message in the adaptive decimation HTML Table labels that the present invention is provided, in the structured message in extracting webpage, with more preferable universality and reliability.

Description

The method and device of structured message in adaptive decimation HTML Table labels

Technical field

The present invention relates to technical field of data processing, and in particular to structure in a kind of adaptive decimation HTML Table labels The method and device of change information.

Background technology

With the popularization and development of internet, e-commerce website, portal website, blog, various types of letters such as microblogging Breath is all issued on the internet, and people by internet can collect magnanimity information and be analyzed, count, to obtain needs Information.

But, these information on webpage are mostly semi-structured data, and the form on webpage is marked using HTML Table The mode of label enters edlin, although the display effect on the page is more regular, but bottom label and data and irregular, very To very chaotic, cause title division and mixed in together with business datum, it is impossible to rapidly and accurately extract business datum.

Conventional processing method is to obtain page resource in advance, then has the HTML of same page structure for each Page bespoke program.The randomness of flexibility and the exploitation of HTML, causes the pattern of form ever-changing, perhaps there is form Perhaps, title, remarks, do not have, and may is that transverse direction, may is that longitudinal direction, once tableau format there occurs change, it is necessary to write new Program.Therefore, the existing method exploitation for extracting structural data in webpage and maintenance efficiency be not high, lacks universality and can By property.

The content of the invention

Structured message in the adaptive decimation HTML Table labels provided for defect of the prior art, the present invention Method and device, extract webpage in structured message when, with more preferable universality and reliability.

In a first aspect, the present invention provide a kind of adaptive decimation HTML Table labels in structured message method, Including：Title division in detection Table labels；Extract in the Table labels except the various dimensions information of title division；According to The various dimensions information for extracting judges table-layout；According to the table-layout, to the various dimensions information in it is direct in Appearance carries out postpositive disposal, obtains structural data.

The method of structured message, detects Table first in the adaptive decimation HTML Table labels that the present invention is provided Title division in label, eliminates the content for being not belonging to business datum part, prevents being mixed into for hash；Then extract Except the various dimensions information of title division in Table labels, table-layout is judged according to various dimensions informix, due to Table marks Information in label can reflect table-layout, therefore, no matter the form in webpage there occurs that what kind of changes, by Table Information in label is analyzed, and can obtain new table-layout.Therefore, the adaptive decimation HTML that the present embodiment is provided The method of structured message in Table labels, the layout without knowing form in advance, for different structure HTML Table without Program need to be again being write, solve the problems, such as that existing Table extraction algorithms lack universality, while improve extracted data Reliability, especially to extensive semi-structured data recognize and extract when it is more practical.

Preferably, the title division in the detection Table labels, including：Every a line is detected in Table labels whether It is a Merge Cells, if so, then detected row belongs to title division, and carries out the detection of next line；If it is not, then stopping The only detection of title division.

Preferably, except the various dimensions information of title division in the extraction Table labels, including：Extract described Except the various dimensions information of title division in Table labels, after being split to the Merge Cells in the information of extraction, then will be every The information of individual dimension respectively with two-dimensional array form store, and to split cell do special marking.

Preferably, it is described to judge that table-layout includes at least one in following operation：According to the direct content for extracting, row Except the row and column for not being TL；According to the background-color property distributions for extracting, table-layout is judged；According to same a line or Whether the data type of the direct content in same row is identical, judges table-layout；It is distributed according to th/td, judges table-layout.

Preferably, described according to the direct content for extracting, exclusion is not the row and column of TL, including：Line by line, detection is taken out by column The direct content for taking；If the data type of the direct content is numeric type character string, row or column where the direct content It is not just TL；If the field length of the direct content exceedes threshold value, row or column where the direct content is not just TL；If Include given keyword in the multinomial direct content of certain a line or a certain row, then the row or column is TL.

Preferably, it is described to be distributed according to th/td, judge table-layout, including：If there is th distributions in Table labels, Table-layout is judged according to th distributions, if being distributed in the absence of th in Table labels, table-layout is judged according to td distributions.

Preferably, also include：If judging, table-layout is laid out for longitudinal direction, and the form transposition that direct content is formed is horizontal stroke To layout.

Preferably, also include：If judging, table-layout is many TL, and doing cutting to the form that direct content is formed merges behaviour Make, be converted to the layout of single TL.

Preferably, it is described that cutting merging behaviour is to the form that direct content is formed if described judge that table-layout is many TL Make, be converted to the layout of single TL, including：Compare the direct content of multiple TL；Content identical TL only retains a line TL；By content Different TL splices TL in a row.

Second aspect, the device of structured message in a kind of adaptive decimation HTML Table labels that the present invention is provided, Including：Title division detection module, for detecting the title division in Table labels；Information extraction module, it is described for extracting Except the various dimensions information of title division in Table labels；Table-layout judge module, for according to the various dimensions letter for extracting Breath judges table-layout；Postpositive disposal module, for according to the table-layout, to the direct content in the various dimensions information Postpositive disposal is carried out, the data of structuring are obtained.

The device of structured message, detects Table first in the adaptive decimation HTML Table labels that the present invention is provided Title division in label, eliminates the content for being not belonging to business datum part, prevents being mixed into for hash；Then extract Except the various dimensions information of title division in Table labels, table-layout is judged according to various dimensions informix, due to Table marks Information in label can reflect table-layout, therefore, no matter the form in webpage there occurs that what kind of changes, by Table Information in label is analyzed, and can obtain new table-layout.Therefore, the adaptive decimation HTML that the present embodiment is provided The method of structured message in Table labels, the layout without knowing form in advance, for different structure HTML Table without Program need to be again being write, solve the problems, such as that existing Table extraction algorithms lack universality, while improve extracted data Reliability, especially to extensive semi-structured data recognize and extract when more effectively.

Brief description of the drawings

The method of structured message in the adaptive decimation HTML Table labels that Fig. 1 is provided by the embodiment of the present invention Flow chart；

Fig. 2 is the layout of the title division, remarks section and business datum part in an example table；

Fig. 3 is the example of many TL layouts in longitudinal direction；

Fig. 4 is the example of laterally many TL layouts；

Fig. 5 is an example for the form of many TL layouts cut merging；

Fig. 6 is an example for the form of many TL layouts cut merging；

Fig. 7 is the example processed the form of single TL (multistage) layouts；

The device of structured message in the adaptive decimation HTML Table labels that Fig. 8 is provided by the embodiment of the present invention Structured flowchart.

Specific embodiment

The embodiment of technical solution of the present invention is described in detail below in conjunction with accompanying drawing.Following examples are only used for Technical scheme is clearly illustrated, therefore is intended only as example, and protection of the invention can not be limited with this Scope.

It should be noted that unless otherwise indicated, technical term used in this application or scientific terminology should be this hair The ordinary meaning that bright one of ordinary skill in the art are understood.

Form in webpage by<table>Label is defined.The row of form by<tr>Tag definition,<tr>Must be at one <table></table>The inside, it is impossible to be used alone.Often row be divided into some cells, each cell by<td>Label Definition,<td>Needs are nested in<tr></tr>It is middle.<th>With<td>Equally it is also that needs are nested in<tr>Central,<th >...</th>For defining gauge outfit cell, comprising be Table Header information.Detailed directions are as follows：

<table>

<tr>

</tr>

<tr>

<td>Zhang San</td>

</tr>

</teble>

The form that above-mentioned code shows in webpage is as follows：

Name	Age
		Zhang San	40

In order to automatically extract the data in web page form, a kind of adaptive decimation HTML Table marks are present embodiments provided The method of structured message in label, as shown in figure 1, including：

Step S1, the title division in detection Table labels.

As shown in Fig. 2 form caption part is generally all a big Merge Cells, it may be possible to a line or multirow, table Lattice can also include remarks section, and the structure of remarks section is similar with title division, remove title division and the remarks portion of form Point, remainder is exactly the business datum for needing to extract.When there is remarks section in form, detection is also needed in step S1 standby Note part, detection mode is identical with the detection mode of title division.

Step S2, except the various dimensions information of title division in extraction Table labels.

Wherein, various dimensions information includes：Direct content, th/td distributions, class property distributions, background-color Property distribution etc..Direct content is the content directly displayed in form in webpage, i.e.,<table>Content of text in label, such as " name ", " age ", " Zhang San ", " 40 ".Th/td distributions refer to distributing position of the th and td labels in this table.class Attribute specifies the class name of element in cell, and class property distributions refer to distributing position of the class attributes in this table. Background-color attributes define the background color of cell, and background-color property distributions refer to Distributing position of the background-color attributes in this table.

Step S3, table-layout is judged according to the various dimensions information for extracting.

Wherein, common table-layout is divided into laterally list TL, laterally many TL, longitudinal list TL, many TL in longitudinal direction, many table packs. TL (TitleLine) is row head (or data header part) (may be physically multirow, but be in logic a region), is represented The first row of the business datum part in the title of every business datum, such as Fig. 2 and be TL.TL can be horizontal, it is also possible to It is longitudinal, is illustrated in figure 3 many TL layouts in longitudinal direction, Fig. 4 is horizontal many TL layouts.

Step S4, according to table-layout, to various dimensions information in direct content carry out postpositive disposal, obtain structuring number According to.

Wherein, postpositive disposal is including splitting merging data block, deleting blank line, replacement spcial character etc..

The method of structured message, detects first in the adaptive decimation HTML Table labels that the present embodiment is provided Title division in Table labels, eliminates the content for being not belonging to business datum part, prevents being mixed into for hash；Then Except the various dimensions information of title division in extraction Table labels, table-layout is judged according to various dimensions informix, due to Information in Table labels can reflect table-layout, therefore, no matter the form in webpage there occurs that what kind of changes, and pass through Information in Table labels is analyzed, new table-layout can be obtained.Therefore, the self adaptation that the present embodiment is provided The method for extracting structured message in HTML Table labels, the layout without knowing form in advance, for different structure HTML Table solve the problems, such as that existing Table extraction algorithms lack universality, while carrying without writing program again The reliability of extracted data high, when especially being recognized to extensive semi-structured data and extracted more effectively.

Title division and remarks section are general all in the first row or the second row of form, and are a Merge Cellses, because This, the specific implementation of step S1 includes：Detect whether per a line be a Merge Cells in Table labels, if so, Then detected row belongs to title division, and carries out the detection of next line；If it is not, represent that the row is initially business datum, then Stop the detection of title division.For example, the code of title division and remarks section is generally following form：

<tr><Td colspan=' 5 '>People information statistical form in 2016</td></tr>

Above-mentioned code only includes one<td>Label, and colspan=' 5 ' shows that this is a Merge Cells, leads to Cross detection<td>Just title division and remarks section can be recognized with colspan with industry.

Prior art, when hash (such as title division, remarks section) is filtered, it is necessary to be known a priori by hash Position, then specifies good position to skip first few lines hash in a program.And the method for the present embodiment is with more general Property, no matter how many row headers part of form and remarks section, accurately and efficiently it can be detected, to guarantee Business datum is drawn into exactly.

During extracted data, in addition to direct access standard element lattice corresponding informance, in addition it is also necessary to which Merge Cells is done Especially treatment, makes the data of extraction meet storage format, facilitates subsequent treatment, therefore, the preferred embodiment of step S2 includes：Extract Except the various dimensions information of title division (have remarks section if, also including remarks section) in Table labels, to being extracted information In Merge Cells split after, then by the information of each dimension respectively with two-dimensional array form store, and to split Cell does special marking.

Wherein, Merge Cells is divided into horizontal meaders (colspan), vertical consolidation (rowspan), mixing merging again (colspan+rowspan).For example：It is right<The bgcolor=of td colspan=' 5 ' " #F7FBFE ">ABC</td>Extract direct After content：

ABC

{←}

Wherein, special marking " { ← } " is that the direct content of extraction is distinctive, represents the content in the cell and its left side Content in cell is identical, in order to treatment and final content the output offer flexibility to TL, and other data Extraction need not do special marking.

Extracting ' background-color property distributions ' is：

#F7FBFE

When there is multiple horizontal meaders (colspan) in single file, in addition it is also necessary to note the problem of coordinate translation.For example <Td colspan=' 2 '>ABC</td><Td colspan=' 3 '>DEF</td>

ABC

{←}

DEF

{←}

It is also adopted by similar method and enters line number for vertical consolidation (rowspan), mixing merging (colspan+rowspan) According to extraction.

Step S3 judges that table-layout is key of the invention, only knows table-layout, could exactly extract industry Business data, and form is converted into by structural data according to table-layout.Judgement table-layout in step S3 includes following several Plant operation：

(1) according to the direct content for extracting, exclusion is not the row and column of TL.

Removing property judgement is carried out according to the data type of direct content, length, keyword in TL.Its basis for estimation includes： Field name length in TL each cell can not possibly exceed threshold value (such as 50), and the field name number of TL can not possibly exceed threshold Value (such as 1000), field name is unlikely to be pure digi-tal character string, and common field name includes " title ", " Name ", " The keyword such as location ", " Address ", " address ", " type ", " remarks ", keywords database is obtained according to common table statistics, inspection Whether survey in row or column comprising the keyword in keywords database.

Therefore, the step that implements for carrying out table-layout judgement based on direct content is：Line by line, detect what is extracted by column Direct content；If the data type of direct content is numeric type character string, row or column where direct content is not just TL；If straight The field length for connecing content exceedes first threshold, then row or column where direct content is not just TL；If certain a line or a certain row Include given keyword in multinomial direct content, then the row or column is TL.

Wherein, when using the determination methods for being based on keyword, in order to ensure to judge reliability, at least need occur two Keyword is just it can be assumed that the row or column is TL.

(2) according to the background-color property distributions for extracting, table-layout is judged.

When form shows, in order to provide the user with the convenience of reading, the background colour meeting of the background colour and data of form TL Difference, or the parity rows of data can use background colour staggeredly, therefore, background-color property distributions can be used to Judge which row or column is probably TL, and then judge that table-layout is transverse direction or longitudinal direction.

(3) according to the class property distributions for extracting, table-layout is judged.

The cell for having identical class attributes is usually similar cell.If the class attributes of all row cells are equal Identical, then table-layout is landscape layout；If the class attribute all sames of all row cells, table-layout is longitudinal cloth Office, therefore, transverse direction or longitudinal direction may determine that by class property distributions.

(4) whether according to identical with the data type of the direct content in a line or same row, table-layout is judged.

TL parts are removed in the business datum of form, the data of the cell under each field name of TL are so long as not null value , their type should all be that (the method can only distinguish ' pure digi-tal ocra font ocr string ', ' date-time ocra font ocr to identical String ', ' without obvious characteristic character string ').Such as form in Fig. 2 is landscape layout, wherein business datum part, except the first row TL, the data type of remaining each column unit lattice is all identical, and such as field name " sequence number " this row are all pure digi-tal ocra font ocrs String, field name " performing law court " this row are all ' without obvious characteristic character string ' that field name " execution Reference Number " this row are all ' nothings Obvious characteristic character string ', in a word, in addition to TL rows, the data type of each row is all identical.

According to above-mentioned characteristic, detect whether all identical with the data type of a line, if the data type of all rows of the form All identical (either all it is ' pure digi-tal ocra font ocr string ' i.e. with the data type of all cells in a line, or all it is ' during the date Between ocra font ocr string ' or ' without obvious characteristic character string '), then the form for longitudinal direction be laid out；Whether the data type of detection same row It is all identical, the data type of all row of form it is all identical (or the data type of all cells is all ' cardinar number i.e. in same row Font character string ', or be all ' date-time ocra font ocr string ' or ' without obvious characteristic character string '), then the form is horizontal cloth Office.

Can be present the situation of null value in some cells, to avoid these cells from influenceing testing result, done to row and column During detection, content is that empty cell does not include detection range.

The data volume of form business datum part is typically more, and all row and columns are all carried out with detection can reduce judgement effect Rate, it is therefore possible to use short circuit judges, if the judged result of that is, new a line can negate certain layout, can jump out and sentence It is disconnected.

(5) it is distributed according to th/td, judges table-layout.

The cell quantity of TL is less than or equal to the cell quantity of other rows, and the cell quantity of non-TL should compare system One.According to the cell quantity of all row and columns of th/td distribution statisticses, the substantially few row or column of cell quantity may be TL, and be laterally or longitudinal according to TL, the quantity of TL can be obtained by table-layout.

Th is generally used to define title, and corresponding is exactly ' name ', field name as ' age '.The layout of th is likely to There is difference laterally, longitudinal, such as landscape layout is<Th colspan=' 3 '>List of results</th>, longitudinal direction is laid out and is<th Rowspan=' 3 '>List of results</th>.

Td can be used to define common cell, it is also possible to for defining title.

When having th labels and td labels simultaneously, table-layout is judged according to th distributions.But many table will not be specified Th, now judges table-layout according to td distributions.

After the above-mentioned five kinds methods for judging table-layout can carry out various combinations according to the actual requirements, table-layout is entered Row judgement, improves judging nicety rate；In addition, the method for the present embodiment is capable of identify that the situation of many TL in form, improves and extract The reliability of data.

When table-layout is longitudinal direction layout, in addition it is also necessary to which the form transposition for forming direct content is horizontal cloth Office.

TL is divided into single-stage TL, two kinds of multistage TL again, but in the case where illustrating, is all referred to as TL.Such as Fig. 2 institutes Show, only one of which TL and single-stage TL.As shown in fig. 7, only one of which TL and for multistage TL (is made up of multirow, there is the person in servitude of the superior and the subordinate Category relation), the field name in this multirow need to be merged, form the output of single file field name.As shown in fig. 7, TL in former form It is divided into two parts, left-hand component is multirow (multistage), and right-hand component is single file, and the first order of multistage part is Merge Cells, Field is entitled ' essential information ', and the second level of multistage part is ' name ', ' age ', ' sex ' field, final output single-stage TL, Its structure is " essential information _ name ", " essential information _ age ", " essential information _ sex ", " other fields A ", " other words Section B ".

When table-layout is many TL, in addition it is also necessary to do cutting union operation to the form that direct content is formed, will Its layout for being converted to single TL, to meet the call format of structural data.Cutting union operation includes：Compare the straight of multiple TL Connect content；Content identical TL only retains a line TL, as shown in Figure 5；The different TL of content is spliced into TL in a row, such as Fig. 6 institutes Show.

Finally, for Merge Cells, special marking can be corrected according to service needed.For example

ABC

{←}

Can be adjusted to following form：

ABC

The method of structured message is directed to the extraction side of single form in above-mentioned adaptive decimation HTML Table labels Method, when there is multiple Table label (multiple forms) in webpage, only need to reuse above-mentioned adaptive decimation HTML Table The method of structured message in label, extracts each corresponding form of Table labels, and result then will be extracted at predetermined regular Merge.

Based on the method identical inventive concept with structured message in above-mentioned adaptive decimation HTML Table labels, this Embodiment additionally provides a kind of device of structured message in adaptive decimation HTML Table labels, as shown in figure 8, including： Title division detection module, for detecting the title division in Table labels；Information extraction module, for extracting Table labels In except title division various dimensions information；Table-layout judge module, for judging form cloth according to the various dimensions information for extracting Office；Postpositive disposal module, for according to table-layout, to various dimensions information in direct content carry out postpositive disposal, tied The data of structure.

The device of structured message, detects first in the adaptive decimation HTML Table labels that the present embodiment is provided Title division in Table labels, eliminates the content for being not belonging to business datum part, prevents being mixed into for hash；Then Except the various dimensions information of title division in extraction Table labels, table-layout is judged according to various dimensions informix, due to Information in Table labels can reflect table-layout, therefore, no matter the form in webpage there occurs that what kind of changes, and pass through Information in Table labels is analyzed, new table-layout can be obtained.Therefore, the self adaptation that the present embodiment is provided The method for extracting structured message in HTML Table labels, the layout without knowing form in advance, for different structure HTML Table solve the problems, such as that existing Table extraction algorithms lack universality, while carrying without writing program again The reliability of extracted data high, when especially being recognized to extensive semi-structured data and extracted more effectively.

Finally it should be noted that：Various embodiments above is merely illustrative of the technical solution of the present invention, rather than its limitations；To the greatest extent Pipe has been described in detail with reference to foregoing embodiments to the present invention, it will be understood by those within the art that：Its according to The technical scheme described in foregoing embodiments can so be modified, or which part or all technical characteristic are entered Row equivalent；And these modifications or replacement, the essence of appropriate technical solution is departed from various embodiments of the present invention technology The scope of scheme, it all should cover in the middle of the scope of claim of the invention and specification.

Claims

1. in a kind of adaptive decimation HTML Table labels structured message method, it is characterised in that including：

Title division in detection Table labels；

Extract in the Table labels except the various dimensions information of title division；

The various dimensions information according to extracting judges table-layout；

According to the table-layout, postpositive disposal is carried out to the direct content in the various dimensions information, obtain structural data.

2. method according to claim 1, it is characterised in that the title division in the detection Table labels, including： Detect whether per a line be a Merge Cells in Table labels, if so, then detected row belongs to title division, and Carry out the detection of next line；If it is not, then stopping the detection of title division.

3. method according to claim 1, it is characterised in that except title division in the extraction Table labels Various dimensions information, including：Extract in the Table labels except the various dimensions information of title division, to the conjunction in the information that extracts And after cell is split, then the information of each dimension is stored in two-dimensional array form respectively, and the cell to splitting Do special marking.

4. method according to claim 1, it is characterised in that the judgement table-layout is included in following operation at least It is a kind of：

According to the direct content for extracting, exclusion is not the row and column of TL；

According to the background-color property distributions for extracting, table-layout is judged；

According to whether identical with the data type of the direct content in a line or same row, table-layout is judged；

It is distributed according to th/td, judges table-layout.

5. method according to claim 4, it is characterised in that described according to the direct content for extracting, exclusion is not TL Row and column, including：

Line by line, the direct content of extraction is detected by column；

If the data type of the direct content is numeric type character string, row or column where the direct content is not just TL；

If the field length of the direct content exceedes threshold value, row or column where the direct content is not just TL；

If including given keyword in the multinomial direct content of certain a line or a certain row, the row or column is TL.

6. method according to claim 4, it is characterised in that described to be distributed according to th/td, judges table-layout, including： If there is th distributions in Table labels, table-layout is judged according to th distributions, if being distributed in the absence of th in Table labels, Table-layout is judged according to td distributions.

7. method according to claim 4, it is characterised in that also include：If judging, table-layout is laid out for longitudinal direction, will The form transposition that direct content is formed is landscape layout.

8. method according to claim 4, it is characterised in that also include：If judging, table-layout is many TL, to direct The form that content is formed does cutting union operation, is converted to the layout of single TL.

9. method according to claim 8, it is characterised in that described to direct if described judge that table-layout is many TL The form that content is formed does cutting union operation, is converted to the layout of single TL, including：

Compare the direct content of multiple TL；

Content identical TL only retains a line TL；

The different TL of content is spliced into TL in a row.

10. in a kind of adaptive decimation HTML Table labels structured message device, it is characterised in that including：

Title division detection module, for detecting the title division in Table labels；

Information extraction module, for extracting in the Table labels except the various dimensions information of title division；

Table-layout judge module, for judging table-layout according to the various dimensions information for extracting；

Postpositive disposal module, for according to the table-layout, rearmounted place being carried out to the direct content in the various dimensions information Reason, obtains the data of structuring.