CN106777259A - The method and device of structured message in adaptive decimation HTML Table labels - Google Patents
The method and device of structured message in adaptive decimation HTML Table labels Download PDFInfo
- Publication number
- CN106777259A CN106777259A CN201611234018.3A CN201611234018A CN106777259A CN 106777259 A CN106777259 A CN 106777259A CN 201611234018 A CN201611234018 A CN 201611234018A CN 106777259 A CN106777259 A CN 106777259A
- Authority
- CN
- China
- Prior art keywords
- layout
- labels
- row
- content
- title division
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/80—Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
Abstract
The invention belongs to technical field of data processing, and in particular to the method and device of structured message in a kind of adaptive decimation HTML Table labels.The method of structured message in the adaptive decimation HTML Table labels that the present invention is provided, including:Title division in detection Table labels;Extract in the Table labels except the various dimensions information of title division;The various dimensions information according to extracting judges table-layout;According to the table-layout, postpositive disposal is carried out to the direct content in the various dimensions information, obtain structural data.The method and device of structured message in the adaptive decimation HTML Table labels that the present invention is provided, in the structured message in extracting webpage, with more preferable universality and reliability.
Description
Technical field
The present invention relates to technical field of data processing, and in particular to structure in a kind of adaptive decimation HTML Table labels
The method and device of change information.
Background technology
With the popularization and development of internet, e-commerce website, portal website, blog, various types of letters such as microblogging
Breath is all issued on the internet, and people by internet can collect magnanimity information and be analyzed, count, to obtain needs
Information.
But, these information on webpage are mostly semi-structured data, and the form on webpage is marked using HTML Table
The mode of label enters edlin, although the display effect on the page is more regular, but bottom label and data and irregular, very
To very chaotic, cause title division and mixed in together with business datum, it is impossible to rapidly and accurately extract business datum.
Conventional processing method is to obtain page resource in advance, then has the HTML of same page structure for each
Page bespoke program.The randomness of flexibility and the exploitation of HTML, causes the pattern of form ever-changing, perhaps there is form
Perhaps, title, remarks, do not have, and may is that transverse direction, may is that longitudinal direction, once tableau format there occurs change, it is necessary to write new
Program.Therefore, the existing method exploitation for extracting structural data in webpage and maintenance efficiency be not high, lacks universality and can
By property.
The content of the invention
Structured message in the adaptive decimation HTML Table labels provided for defect of the prior art, the present invention
Method and device, extract webpage in structured message when, with more preferable universality and reliability.
In a first aspect, the present invention provide a kind of adaptive decimation HTML Table labels in structured message method,
Including:Title division in detection Table labels;Extract in the Table labels except the various dimensions information of title division;According to
The various dimensions information for extracting judges table-layout;According to the table-layout, to the various dimensions information in it is direct in
Appearance carries out postpositive disposal, obtains structural data.
The method of structured message, detects Table first in the adaptive decimation HTML Table labels that the present invention is provided
Title division in label, eliminates the content for being not belonging to business datum part, prevents being mixed into for hash;Then extract
Except the various dimensions information of title division in Table labels, table-layout is judged according to various dimensions informix, due to Table marks
Information in label can reflect table-layout, therefore, no matter the form in webpage there occurs that what kind of changes, by Table
Information in label is analyzed, and can obtain new table-layout.Therefore, the adaptive decimation HTML that the present embodiment is provided
The method of structured message in Table labels, the layout without knowing form in advance, for different structure HTML Table without
Program need to be again being write, solve the problems, such as that existing Table extraction algorithms lack universality, while improve extracted data
Reliability, especially to extensive semi-structured data recognize and extract when it is more practical.
Preferably, the title division in the detection Table labels, including:Every a line is detected in Table labels whether
It is a Merge Cells, if so, then detected row belongs to title division, and carries out the detection of next line;If it is not, then stopping
The only detection of title division.
Preferably, except the various dimensions information of title division in the extraction Table labels, including:Extract described
Except the various dimensions information of title division in Table labels, after being split to the Merge Cells in the information of extraction, then will be every
The information of individual dimension respectively with two-dimensional array form store, and to split cell do special marking.
Preferably, it is described to judge that table-layout includes at least one in following operation:According to the direct content for extracting, row
Except the row and column for not being TL;According to the background-color property distributions for extracting, table-layout is judged;According to same a line or
Whether the data type of the direct content in same row is identical, judges table-layout;It is distributed according to th/td, judges table-layout.
Preferably, described according to the direct content for extracting, exclusion is not the row and column of TL, including:Line by line, detection is taken out by column
The direct content for taking;If the data type of the direct content is numeric type character string, row or column where the direct content
It is not just TL;If the field length of the direct content exceedes threshold value, row or column where the direct content is not just TL;If
Include given keyword in the multinomial direct content of certain a line or a certain row, then the row or column is TL.
Preferably, it is described to be distributed according to th/td, judge table-layout, including:If there is th distributions in Table labels,
Table-layout is judged according to th distributions, if being distributed in the absence of th in Table labels, table-layout is judged according to td distributions.
Preferably, also include:If judging, table-layout is laid out for longitudinal direction, and the form transposition that direct content is formed is horizontal stroke
To layout.
Preferably, also include:If judging, table-layout is many TL, and doing cutting to the form that direct content is formed merges behaviour
Make, be converted to the layout of single TL.
Preferably, it is described that cutting merging behaviour is to the form that direct content is formed if described judge that table-layout is many TL
Make, be converted to the layout of single TL, including:Compare the direct content of multiple TL;Content identical TL only retains a line TL;By content
Different TL splices TL in a row.
Second aspect, the device of structured message in a kind of adaptive decimation HTML Table labels that the present invention is provided,
Including:Title division detection module, for detecting the title division in Table labels;Information extraction module, it is described for extracting
Except the various dimensions information of title division in Table labels;Table-layout judge module, for according to the various dimensions letter for extracting
Breath judges table-layout;Postpositive disposal module, for according to the table-layout, to the direct content in the various dimensions information
Postpositive disposal is carried out, the data of structuring are obtained.
The device of structured message, detects Table first in the adaptive decimation HTML Table labels that the present invention is provided
Title division in label, eliminates the content for being not belonging to business datum part, prevents being mixed into for hash;Then extract
Except the various dimensions information of title division in Table labels, table-layout is judged according to various dimensions informix, due to Table marks
Information in label can reflect table-layout, therefore, no matter the form in webpage there occurs that what kind of changes, by Table
Information in label is analyzed, and can obtain new table-layout.Therefore, the adaptive decimation HTML that the present embodiment is provided
The method of structured message in Table labels, the layout without knowing form in advance, for different structure HTML Table without
Program need to be again being write, solve the problems, such as that existing Table extraction algorithms lack universality, while improve extracted data
Reliability, especially to extensive semi-structured data recognize and extract when more effectively.
Brief description of the drawings
The method of structured message in the adaptive decimation HTML Table labels that Fig. 1 is provided by the embodiment of the present invention
Flow chart;
Fig. 2 is the layout of the title division, remarks section and business datum part in an example table;
Fig. 3 is the example of many TL layouts in longitudinal direction;
Fig. 4 is the example of laterally many TL layouts;
Fig. 5 is an example for the form of many TL layouts cut merging;
Fig. 6 is an example for the form of many TL layouts cut merging;
Fig. 7 is the example processed the form of single TL (multistage) layouts;
The device of structured message in the adaptive decimation HTML Table labels that Fig. 8 is provided by the embodiment of the present invention
Structured flowchart.
Specific embodiment
The embodiment of technical solution of the present invention is described in detail below in conjunction with accompanying drawing.Following examples are only used for
Technical scheme is clearly illustrated, therefore is intended only as example, and protection of the invention can not be limited with this
Scope.
It should be noted that unless otherwise indicated, technical term used in this application or scientific terminology should be this hair
The ordinary meaning that bright one of ordinary skill in the art are understood.
Form in webpage by<table>Label is defined.The row of form by<tr>Tag definition,<tr>Must be at one
<table></table>The inside, it is impossible to be used alone.Often row be divided into some cells, each cell by<td>Label
Definition,<td>Needs are nested in<tr></tr>It is middle.<th>With<td>Equally it is also that needs are nested in<tr>Central,<th
>...</th>For defining gauge outfit cell, comprising be Table Header information.Detailed directions are as follows:
<table>
<tr>
<th>Name</th>
<th>Age</th>
</tr>
<tr>
<td>Zhang San</td>
<td>40</td>
</tr>
</teble>
The form that above-mentioned code shows in webpage is as follows:
Name | Age |
Zhang San | 40 |
In order to automatically extract the data in web page form, a kind of adaptive decimation HTML Table marks are present embodiments provided
The method of structured message in label, as shown in figure 1, including:
Step S1, the title division in detection Table labels.
As shown in Fig. 2 form caption part is generally all a big Merge Cells, it may be possible to a line or multirow, table
Lattice can also include remarks section, and the structure of remarks section is similar with title division, remove title division and the remarks portion of form
Point, remainder is exactly the business datum for needing to extract.When there is remarks section in form, detection is also needed in step S1 standby
Note part, detection mode is identical with the detection mode of title division.
Step S2, except the various dimensions information of title division in extraction Table labels.
Wherein, various dimensions information includes:Direct content, th/td distributions, class property distributions, background-color
Property distribution etc..Direct content is the content directly displayed in form in webpage, i.e.,<table>Content of text in label, such as
" name ", " age ", " Zhang San ", " 40 ".Th/td distributions refer to distributing position of the th and td labels in this table.class
Attribute specifies the class name of element in cell, and class property distributions refer to distributing position of the class attributes in this table.
Background-color attributes define the background color of cell, and background-color property distributions refer to
Distributing position of the background-color attributes in this table.
Step S3, table-layout is judged according to the various dimensions information for extracting.
Wherein, common table-layout is divided into laterally list TL, laterally many TL, longitudinal list TL, many TL in longitudinal direction, many table packs.
TL (TitleLine) is row head (or data header part) (may be physically multirow, but be in logic a region), is represented
The first row of the business datum part in the title of every business datum, such as Fig. 2 and be TL.TL can be horizontal, it is also possible to
It is longitudinal, is illustrated in figure 3 many TL layouts in longitudinal direction, Fig. 4 is horizontal many TL layouts.
Step S4, according to table-layout, to various dimensions information in direct content carry out postpositive disposal, obtain structuring number
According to.
Wherein, postpositive disposal is including splitting merging data block, deleting blank line, replacement spcial character etc..
The method of structured message, detects first in the adaptive decimation HTML Table labels that the present embodiment is provided
Title division in Table labels, eliminates the content for being not belonging to business datum part, prevents being mixed into for hash;Then
Except the various dimensions information of title division in extraction Table labels, table-layout is judged according to various dimensions informix, due to
Information in Table labels can reflect table-layout, therefore, no matter the form in webpage there occurs that what kind of changes, and pass through
Information in Table labels is analyzed, new table-layout can be obtained.Therefore, the self adaptation that the present embodiment is provided
The method for extracting structured message in HTML Table labels, the layout without knowing form in advance, for different structure
HTML Table solve the problems, such as that existing Table extraction algorithms lack universality, while carrying without writing program again
The reliability of extracted data high, when especially being recognized to extensive semi-structured data and extracted more effectively.
Title division and remarks section are general all in the first row or the second row of form, and are a Merge Cellses, because
This, the specific implementation of step S1 includes:Detect whether per a line be a Merge Cells in Table labels, if so,
Then detected row belongs to title division, and carries out the detection of next line;If it is not, represent that the row is initially business datum, then
Stop the detection of title division.For example, the code of title division and remarks section is generally following form:
<tr><Td colspan=' 5 '>People information statistical form in 2016</td></tr>
Above-mentioned code only includes one<td>Label, and colspan=' 5 ' shows that this is a Merge Cells, leads to
Cross detection<td>Just title division and remarks section can be recognized with colspan with industry.
Prior art, when hash (such as title division, remarks section) is filtered, it is necessary to be known a priori by hash
Position, then specifies good position to skip first few lines hash in a program.And the method for the present embodiment is with more general
Property, no matter how many row headers part of form and remarks section, accurately and efficiently it can be detected, to guarantee
Business datum is drawn into exactly.
During extracted data, in addition to direct access standard element lattice corresponding informance, in addition it is also necessary to which Merge Cells is done
Especially treatment, makes the data of extraction meet storage format, facilitates subsequent treatment, therefore, the preferred embodiment of step S2 includes:Extract
Except the various dimensions information of title division (have remarks section if, also including remarks section) in Table labels, to being extracted information
In Merge Cells split after, then by the information of each dimension respectively with two-dimensional array form store, and to split
Cell does special marking.
Wherein, Merge Cells is divided into horizontal meaders (colspan), vertical consolidation (rowspan), mixing merging again
(colspan+rowspan).For example:It is right<The bgcolor=of td colspan=' 5 ' " #F7FBFE ">ABC</td>Extract direct
After content:
ABC | {←} | {←} | {←} | {←} |
Wherein, special marking " { ← } " is that the direct content of extraction is distinctive, represents the content in the cell and its left side
Content in cell is identical, in order to treatment and final content the output offer flexibility to TL, and other data
Extraction need not do special marking.
Extracting ' background-color property distributions ' is:
#F7FBFE | #F7FBFE | #F7FBFE | #F7FBFE | #F7FBFE |
When there is multiple horizontal meaders (colspan) in single file, in addition it is also necessary to note the problem of coordinate translation.For example
<Td colspan=' 2 '>ABC</td><Td colspan=' 3 '>DEF</td>
ABC | {←} | DEF | {←} | {←} |
It is also adopted by similar method and enters line number for vertical consolidation (rowspan), mixing merging (colspan+rowspan)
According to extraction.
Step S3 judges that table-layout is key of the invention, only knows table-layout, could exactly extract industry
Business data, and form is converted into by structural data according to table-layout.Judgement table-layout in step S3 includes following several
Plant operation:
(1) according to the direct content for extracting, exclusion is not the row and column of TL.
Removing property judgement is carried out according to the data type of direct content, length, keyword in TL.Its basis for estimation includes:
Field name length in TL each cell can not possibly exceed threshold value (such as 50), and the field name number of TL can not possibly exceed threshold
Value (such as 1000), field name is unlikely to be pure digi-tal character string, and common field name includes " title ", " Name ", "
The keyword such as location ", " Address ", " address ", " type ", " remarks ", keywords database is obtained according to common table statistics, inspection
Whether survey in row or column comprising the keyword in keywords database.
Therefore, the step that implements for carrying out table-layout judgement based on direct content is:Line by line, detect what is extracted by column
Direct content;If the data type of direct content is numeric type character string, row or column where direct content is not just TL;If straight
The field length for connecing content exceedes first threshold, then row or column where direct content is not just TL;If certain a line or a certain row
Include given keyword in multinomial direct content, then the row or column is TL.
Wherein, when using the determination methods for being based on keyword, in order to ensure to judge reliability, at least need occur two
Keyword is just it can be assumed that the row or column is TL.
(2) according to the background-color property distributions for extracting, table-layout is judged.
When form shows, in order to provide the user with the convenience of reading, the background colour meeting of the background colour and data of form TL
Difference, or the parity rows of data can use background colour staggeredly, therefore, background-color property distributions can be used to
Judge which row or column is probably TL, and then judge that table-layout is transverse direction or longitudinal direction.
(3) according to the class property distributions for extracting, table-layout is judged.
The cell for having identical class attributes is usually similar cell.If the class attributes of all row cells are equal
Identical, then table-layout is landscape layout;If the class attribute all sames of all row cells, table-layout is longitudinal cloth
Office, therefore, transverse direction or longitudinal direction may determine that by class property distributions.
(4) whether according to identical with the data type of the direct content in a line or same row, table-layout is judged.
TL parts are removed in the business datum of form, the data of the cell under each field name of TL are so long as not null value
, their type should all be that (the method can only distinguish ' pure digi-tal ocra font ocr string ', ' date-time ocra font ocr to identical
String ', ' without obvious characteristic character string ').Such as form in Fig. 2 is landscape layout, wherein business datum part, except the first row
TL, the data type of remaining each column unit lattice is all identical, and such as field name " sequence number " this row are all pure digi-tal ocra font ocrs
String, field name " performing law court " this row are all ' without obvious characteristic character string ' that field name " execution Reference Number " this row are all ' nothings
Obvious characteristic character string ', in a word, in addition to TL rows, the data type of each row is all identical.
According to above-mentioned characteristic, detect whether all identical with the data type of a line, if the data type of all rows of the form
All identical (either all it is ' pure digi-tal ocra font ocr string ' i.e. with the data type of all cells in a line, or all it is ' during the date
Between ocra font ocr string ' or ' without obvious characteristic character string '), then the form for longitudinal direction be laid out;Whether the data type of detection same row
It is all identical, the data type of all row of form it is all identical (or the data type of all cells is all ' cardinar number i.e. in same row
Font character string ', or be all ' date-time ocra font ocr string ' or ' without obvious characteristic character string '), then the form is horizontal cloth
Office.
Can be present the situation of null value in some cells, to avoid these cells from influenceing testing result, done to row and column
During detection, content is that empty cell does not include detection range.
The data volume of form business datum part is typically more, and all row and columns are all carried out with detection can reduce judgement effect
Rate, it is therefore possible to use short circuit judges, if the judged result of that is, new a line can negate certain layout, can jump out and sentence
It is disconnected.
(5) it is distributed according to th/td, judges table-layout.
The cell quantity of TL is less than or equal to the cell quantity of other rows, and the cell quantity of non-TL should compare system
One.According to the cell quantity of all row and columns of th/td distribution statisticses, the substantially few row or column of cell quantity may be
TL, and be laterally or longitudinal according to TL, the quantity of TL can be obtained by table-layout.
Th is generally used to define title, and corresponding is exactly ' name ', field name as ' age '.The layout of th is likely to
There is difference laterally, longitudinal, such as landscape layout is<Th colspan=' 3 '>List of results</th>, longitudinal direction is laid out and is<th
Rowspan=' 3 '>List of results</th>.
Td can be used to define common cell, it is also possible to for defining title.
When having th labels and td labels simultaneously, table-layout is judged according to th distributions.But many table will not be specified
Th, now judges table-layout according to td distributions.
After the above-mentioned five kinds methods for judging table-layout can carry out various combinations according to the actual requirements, table-layout is entered
Row judgement, improves judging nicety rate;In addition, the method for the present embodiment is capable of identify that the situation of many TL in form, improves and extract
The reliability of data.
When table-layout is longitudinal direction layout, in addition it is also necessary to which the form transposition for forming direct content is horizontal cloth
Office.
TL is divided into single-stage TL, two kinds of multistage TL again, but in the case where illustrating, is all referred to as TL.Such as Fig. 2 institutes
Show, only one of which TL and single-stage TL.As shown in fig. 7, only one of which TL and for multistage TL (is made up of multirow, there is the person in servitude of the superior and the subordinate
Category relation), the field name in this multirow need to be merged, form the output of single file field name.As shown in fig. 7, TL in former form
It is divided into two parts, left-hand component is multirow (multistage), and right-hand component is single file, and the first order of multistage part is Merge Cells,
Field is entitled ' essential information ', and the second level of multistage part is ' name ', ' age ', ' sex ' field, final output single-stage TL,
Its structure is " essential information _ name ", " essential information _ age ", " essential information _ sex ", " other fields A ", " other words
Section B ".
When table-layout is many TL, in addition it is also necessary to do cutting union operation to the form that direct content is formed, will
Its layout for being converted to single TL, to meet the call format of structural data.Cutting union operation includes:Compare the straight of multiple TL
Connect content;Content identical TL only retains a line TL, as shown in Figure 5;The different TL of content is spliced into TL in a row, such as Fig. 6 institutes
Show.
Finally, for Merge Cells, special marking can be corrected according to service needed.For example
ABC | {←} | {←} | {←} | {←} |
Can be adjusted to following form:
ABC | ABC | ABC | ABC | ABC |
The method of structured message is directed to the extraction side of single form in above-mentioned adaptive decimation HTML Table labels
Method, when there is multiple Table label (multiple forms) in webpage, only need to reuse above-mentioned adaptive decimation HTML Table
The method of structured message in label, extracts each corresponding form of Table labels, and result then will be extracted at predetermined regular
Merge.
Based on the method identical inventive concept with structured message in above-mentioned adaptive decimation HTML Table labels, this
Embodiment additionally provides a kind of device of structured message in adaptive decimation HTML Table labels, as shown in figure 8, including:
Title division detection module, for detecting the title division in Table labels;Information extraction module, for extracting Table labels
In except title division various dimensions information;Table-layout judge module, for judging form cloth according to the various dimensions information for extracting
Office;Postpositive disposal module, for according to table-layout, to various dimensions information in direct content carry out postpositive disposal, tied
The data of structure.
The device of structured message, detects first in the adaptive decimation HTML Table labels that the present embodiment is provided
Title division in Table labels, eliminates the content for being not belonging to business datum part, prevents being mixed into for hash;Then
Except the various dimensions information of title division in extraction Table labels, table-layout is judged according to various dimensions informix, due to
Information in Table labels can reflect table-layout, therefore, no matter the form in webpage there occurs that what kind of changes, and pass through
Information in Table labels is analyzed, new table-layout can be obtained.Therefore, the self adaptation that the present embodiment is provided
The method for extracting structured message in HTML Table labels, the layout without knowing form in advance, for different structure
HTML Table solve the problems, such as that existing Table extraction algorithms lack universality, while carrying without writing program again
The reliability of extracted data high, when especially being recognized to extensive semi-structured data and extracted more effectively.
Finally it should be noted that:Various embodiments above is merely illustrative of the technical solution of the present invention, rather than its limitations;To the greatest extent
Pipe has been described in detail with reference to foregoing embodiments to the present invention, it will be understood by those within the art that:Its according to
The technical scheme described in foregoing embodiments can so be modified, or which part or all technical characteristic are entered
Row equivalent;And these modifications or replacement, the essence of appropriate technical solution is departed from various embodiments of the present invention technology
The scope of scheme, it all should cover in the middle of the scope of claim of the invention and specification.
Claims (10)
1. in a kind of adaptive decimation HTML Table labels structured message method, it is characterised in that including:
Title division in detection Table labels;
Extract in the Table labels except the various dimensions information of title division;
The various dimensions information according to extracting judges table-layout;
According to the table-layout, postpositive disposal is carried out to the direct content in the various dimensions information, obtain structural data.
2. method according to claim 1, it is characterised in that the title division in the detection Table labels, including:
Detect whether per a line be a Merge Cells in Table labels, if so, then detected row belongs to title division, and
Carry out the detection of next line;If it is not, then stopping the detection of title division.
3. method according to claim 1, it is characterised in that except title division in the extraction Table labels
Various dimensions information, including:Extract in the Table labels except the various dimensions information of title division, to the conjunction in the information that extracts
And after cell is split, then the information of each dimension is stored in two-dimensional array form respectively, and the cell to splitting
Do special marking.
4. method according to claim 1, it is characterised in that the judgement table-layout is included in following operation at least
It is a kind of:
According to the direct content for extracting, exclusion is not the row and column of TL;
According to the background-color property distributions for extracting, table-layout is judged;
According to whether identical with the data type of the direct content in a line or same row, table-layout is judged;
It is distributed according to th/td, judges table-layout.
5. method according to claim 4, it is characterised in that described according to the direct content for extracting, exclusion is not TL
Row and column, including:
Line by line, the direct content of extraction is detected by column;
If the data type of the direct content is numeric type character string, row or column where the direct content is not just TL;
If the field length of the direct content exceedes threshold value, row or column where the direct content is not just TL;
If including given keyword in the multinomial direct content of certain a line or a certain row, the row or column is TL.
6. method according to claim 4, it is characterised in that described to be distributed according to th/td, judges table-layout, including:
If there is th distributions in Table labels, table-layout is judged according to th distributions, if being distributed in the absence of th in Table labels,
Table-layout is judged according to td distributions.
7. method according to claim 4, it is characterised in that also include:If judging, table-layout is laid out for longitudinal direction, will
The form transposition that direct content is formed is landscape layout.
8. method according to claim 4, it is characterised in that also include:If judging, table-layout is many TL, to direct
The form that content is formed does cutting union operation, is converted to the layout of single TL.
9. method according to claim 8, it is characterised in that described to direct if described judge that table-layout is many TL
The form that content is formed does cutting union operation, is converted to the layout of single TL, including:
Compare the direct content of multiple TL;
Content identical TL only retains a line TL;
The different TL of content is spliced into TL in a row.
10. in a kind of adaptive decimation HTML Table labels structured message device, it is characterised in that including:
Title division detection module, for detecting the title division in Table labels;
Information extraction module, for extracting in the Table labels except the various dimensions information of title division;
Table-layout judge module, for judging table-layout according to the various dimensions information for extracting;
Postpositive disposal module, for according to the table-layout, rearmounted place being carried out to the direct content in the various dimensions information
Reason, obtains the data of structuring.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611234018.3A CN106777259A (en) | 2016-12-28 | 2016-12-28 | The method and device of structured message in adaptive decimation HTML Table labels |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611234018.3A CN106777259A (en) | 2016-12-28 | 2016-12-28 | The method and device of structured message in adaptive decimation HTML Table labels |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106777259A true CN106777259A (en) | 2017-05-31 |
Family
ID=58924561
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611234018.3A Pending CN106777259A (en) | 2016-12-28 | 2016-12-28 | The method and device of structured message in adaptive decimation HTML Table labels |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106777259A (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107992625A (en) * | 2017-12-25 | 2018-05-04 | 湖南星汉数智科技有限公司 | A kind of automatic abstracting method of web page form data and device |
CN109710771A (en) * | 2018-10-30 | 2019-05-03 | 北京百度网讯科技有限公司 | Form data extracting method, device and storage medium |
CN110321530A (en) * | 2019-06-28 | 2019-10-11 | 南京智录信息科技有限公司 | Table semantization resolution system technology |
CN110334331A (en) * | 2019-05-30 | 2019-10-15 | 重庆金融资产交易所有限责任公司 | Method, apparatus and computer equipment based on order models screening table |
CN110598194A (en) * | 2019-08-09 | 2019-12-20 | 平安科技(深圳)有限公司 | Method and device for extracting content of non-full-grid table and terminal equipment |
CN112380826A (en) * | 2020-11-12 | 2021-02-19 | 中国农业银行股份有限公司佛山分行 | Formatted electronic form generation method based on text file |
CN113656592A (en) * | 2021-07-22 | 2021-11-16 | 北京百度网讯科技有限公司 | Data processing method and device based on knowledge graph, electronic equipment and medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102663023A (en) * | 2012-03-22 | 2012-09-12 | 浙江盘石信息技术有限公司 | Implementation method for extracting web content |
US8600987B2 (en) * | 2007-10-11 | 2013-12-03 | Google Inc. | Classifying search results to determine page elements |
CN106156239A (en) * | 2015-04-27 | 2016-11-23 | 中国移动通信集团公司 | A kind of form abstracting method and device |
-
2016
- 2016-12-28 CN CN201611234018.3A patent/CN106777259A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8600987B2 (en) * | 2007-10-11 | 2013-12-03 | Google Inc. | Classifying search results to determine page elements |
CN102663023A (en) * | 2012-03-22 | 2012-09-12 | 浙江盘石信息技术有限公司 | Implementation method for extracting web content |
CN106156239A (en) * | 2015-04-27 | 2016-11-23 | 中国移动通信集团公司 | A kind of form abstracting method and device |
Non-Patent Citations (1)
Title |
---|
林科锵: ""Web页中表格结构识别的研究与实现"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107992625A (en) * | 2017-12-25 | 2018-05-04 | 湖南星汉数智科技有限公司 | A kind of automatic abstracting method of web page form data and device |
CN109710771A (en) * | 2018-10-30 | 2019-05-03 | 北京百度网讯科技有限公司 | Form data extracting method, device and storage medium |
CN110334331A (en) * | 2019-05-30 | 2019-10-15 | 重庆金融资产交易所有限责任公司 | Method, apparatus and computer equipment based on order models screening table |
CN110321530A (en) * | 2019-06-28 | 2019-10-11 | 南京智录信息科技有限公司 | Table semantization resolution system technology |
CN110598194A (en) * | 2019-08-09 | 2019-12-20 | 平安科技(深圳)有限公司 | Method and device for extracting content of non-full-grid table and terminal equipment |
CN112380826A (en) * | 2020-11-12 | 2021-02-19 | 中国农业银行股份有限公司佛山分行 | Formatted electronic form generation method based on text file |
CN112380826B (en) * | 2020-11-12 | 2024-03-22 | 中国农业银行股份有限公司佛山分行 | Formatting electronic form generating method based on text file |
CN113656592A (en) * | 2021-07-22 | 2021-11-16 | 北京百度网讯科技有限公司 | Data processing method and device based on knowledge graph, electronic equipment and medium |
CN113656592B (en) * | 2021-07-22 | 2022-09-27 | 北京百度网讯科技有限公司 | Data processing method and device based on knowledge graph, electronic equipment and medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106777259A (en) | The method and device of structured message in adaptive decimation HTML Table labels | |
CN106709032B (en) | Method and device for extracting structured information in electronic form document | |
CN109885692B (en) | Knowledge data storage method, apparatus, computer device and storage medium | |
CN101727461B (en) | Method for extracting content of web page | |
CN110968667B (en) | Periodical and literature table extraction method based on text state characteristics | |
CN110795919B (en) | Form extraction method, device, equipment and medium in PDF document | |
CN106156239B (en) | Table extraction method and device | |
CN109522452B (en) | Processing method of massive semi-structured data | |
CN102663023A (en) | Implementation method for extracting web content | |
CN109492177B (en) | web page blocking method based on web page semantic structure | |
CN106777281A (en) | For improving web crawlers stability, the data processing method of availability and device | |
CN105630817A (en) | Electronic invoice content analysis method and system | |
CN106407195B (en) | Method and system for web page duplication elimination | |
CN114153962A (en) | Data matching method and device and electronic equipment | |
CN110738050A (en) | Text recombination method, device and medium based on word segmentation and named entity recognition | |
CN110390037B (en) | Information classification method, device and equipment based on DOM tree and storage medium | |
JP2008077634A (en) | Method and apparatus for automatic form filling on mobile device | |
CN103389981A (en) | Network label automatic identification method and system thereof | |
CN101996190A (en) | Method and device for extracting information from webpage | |
CN102479072B (en) | Multi-header report generating method, device and terminal | |
CN114462383B (en) | Method, system, storage medium and equipment for obtaining design specification of building drawing | |
CN113642291B (en) | Method, system, storage medium and terminal for constructing logical structure tree reported by listed companies | |
CN103136187A (en) | Method and system for extraction of patent rejection information | |
CN111581928B (en) | System and method for automatically constructing scientific and technological text analysis report with zero participation of user | |
CN114220113A (en) | Paper quality detection method, device and equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170531 |