CN106156239B - Table extraction method and device - Google Patents

Table extraction method and device Download PDF

Info

Publication number
CN106156239B
CN106156239B CN201510205847.8A CN201510205847A CN106156239B CN 106156239 B CN106156239 B CN 106156239B CN 201510205847 A CN201510205847 A CN 201510205847A CN 106156239 B CN106156239 B CN 106156239B
Authority
CN
China
Prior art keywords
header
row
content
rows
source
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510205847.8A
Other languages
Chinese (zh)
Other versions
CN106156239A (en
Inventor
周文辉
冯俊兰
黄毅
杨文漪
施瑶
杨瑞兵
邵超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN201510205847.8A priority Critical patent/CN106156239B/en
Publication of CN106156239A publication Critical patent/CN106156239A/en
Application granted granted Critical
Publication of CN106156239B publication Critical patent/CN106156239B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation
    • G06F16/9577Optimising the visualization of content, e.g. distillation of HTML documents

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The invention discloses a table extraction method, which comprises the steps of reading the content of a source table, storing the content of the source table into at least one two-dimensional table according to the content of the source table, reading the header of the source table, extracting the header according to the number of rows of the header, determining the header item according to the extracted header, establishing a table processing model according to the at least one two-dimensional table, and aligning the content of the table in the table processing model with the header item by using the content similarity; the invention also discloses a table extraction device.

Description

Table extraction method and device
Technical Field
The invention relates to a webpage (Web) analysis technology, in particular to a table extraction method and a table extraction device.
Background
Forms are widely used in Web documents as an important form of information presentation, and statistically about 52% of Web pages contain forms. For tables, the syntactic and semantic concepts in the table are intermixed, and the table logical unit cell obtains semantics with its relative location information. Therefore, how to make the machine extract the table information accurately is a challenging problem. Moreover, the table is an important knowledge carrier, and the table has a semi-structured characteristic relative to completely unstructured data, and if the table can be correctly extracted, the table will greatly contribute to the subsequent structured knowledge.
At present, most of data tables on Web are still described by HTML language, the description of the data is lacking, clear semantic information is not contained, and the mode is not clear, so that the extraction of the Web tables is more difficult than the extraction of the traditional tables.
The supervised method processes data by using a Web structure, analyzes Web into a DOM tree, and extracts Web form data by adopting an extraction method based on a path mode; the unsupervised method adopts a top-down limited tree editing distance method, and adopts a top-down tree comparison method for Web information structure difference according to the structural characteristics of Web source code coding and analytic trees.
When there is no or insufficient annotation data, the supervised approach fails to train the appropriate model and is therefore not desirable. In the unsupervised method, it is not scientific if table extraction is performed only according to the structural features of the Web source code encoding and parse tree, because many tables have consistent parse trees but are not semantically consistent.
Disclosure of Invention
In order to solve the existing technical problems, the invention mainly provides a table extraction method and a table extraction device.
The technical scheme of the invention is realized as follows:
the invention provides a table extraction method, which comprises the following steps:
reading the content of the source table, and storing the content into at least one two-dimensional table according to the content of the source table;
reading a header of a source table, and extracting the header according to the number of rows of the header;
determining a table head item according to the extracted table head, and establishing a table processing model according to the at least one two-dimensional table;
the table contents are aligned with the table header entries in the table processing model using the content similarity.
In the above solution, the reading the content of the source table and storing the content of the source table into at least one two-dimensional table includes:
reading the content of a source table, determining the number of head lines according to the number of lines occupied by a first line of source cells, removing the head according to the number of head lines, determining the number of the source table which is split into two-dimensional tables according to the number of lines occupied by a first column of cells, partitioning the source table according to the number of lines occupied by the first column of cells, enabling each table block to correspond to one two-dimensional table, traversing the contents of all the table blocks, determining the number of lines and the number of columns corresponding to the two-dimensional tables, creating and initializing each two-dimensional table, reading the value of a table empty line mark of each table block, determining whether each table block is an abnormal table or a normal table, extracting the source cell content of each table block according to a corresponding extraction rule, and storing the source cell content into the corresponding two-dimensional table.
In the foregoing solution, the reading the header of the source table and extracting the header according to the number of rows of the header includes:
reading and recording the line starting position, the number of lines and the number of columns of each source cell in a hypertext markup language (HTML) label of a source table, determining the number of head lines according to the number of lines max occupied by the first line of source cells, normalizing the head lines into a two-dimensional table, and extracting the title in each line of source cells from the two-dimensional table according to the number of the head lines.
In the foregoing solution, the extracting the header in each row of the source cells from the two-dimensional table according to the number of the header rows includes:
when the number of rows of the header is 1, directly extracting the header in the source cell in the first row as the header;
when the number of the head lines is 2, extracting the first line and the second line into a line header and a line subtitle, and connecting all the line subtitles to the line header in a # mode;
when the number of rows in the header is 3, the first row, the second row and the third row are extracted as a title, a subtitle and a grandchild title, and the subtitles are all connected to the back of the title in # and the grandchild titles are all connected to the back of the subtitles in # respectively.
In the foregoing solution, the determining the header entry according to the extracted header includes:
when the header only consists of one header row, each table cell in the header is defined as a header item;
and when the header comprises a plurality of header rows, splicing according to the hierarchical sequence from top to bottom to obtain a header item.
In the foregoing solution, the establishing a table processing model according to the at least one two-dimensional table includes:
defining a header as a vector H, the header entries being defined as H, wherein a header contains a plurality of header entries, denoted as H ═ H1, H2.., hn >, wherein n ∈ [1, number of header entries ];
defining the table content as D, obtaining content parts in the table content by using at least one two-dimensional table formed according to the content of the source table, dividing the table content into rows, defining each row as D, representing the table content as D ═ D1, D2.., dn >, defining the matrix lattice of the ith row and the jth column of the table content matrix as dij, defining the ith row di ═ di1, di 2.., din >, wherein n ∈ [1, number of head items ].
In the foregoing solution, the aligning table contents with table head items in the table processing model by using content similarity includes:
according to the number of the table head items of the table processing model, searching a regular table row of each matrix row in a table content matrix, and aligning the table content corresponding to the regular table row with the table head items;
and for the rest of the non-regular table rows, taking the aligned regular table rows as a reference, searching the table contents occupying the same matrix column width in the non-regular table rows in the direction of columns to be aligned with the table head item, performing similarity calculation on the rest of the non-aligned table contents and the aligned table contents, finding the table contents with the highest similarity, aligning the table head item corresponding to the table contents as an alignment target table head item, and iteratively performing the similarity calculation in the unit of columns on the rest of the non-aligned table contents to finish the alignment.
The present invention also provides a form extraction apparatus, comprising: the system comprises a first extraction module, a second extraction module, a model building module and an alignment module; wherein the content of the first and second substances,
the first extraction module is used for reading the content of the source table and storing the content into at least one two-dimensional table according to the content of the source table;
the second extraction module is used for reading the header of the source table and extracting the header according to the number of rows of the header;
the model establishing module is used for determining a table head item according to the extracted table head and establishing a table processing model according to the at least one two-dimensional table;
and the alignment module is used for aligning the table content and the table head item in the table processing model by using the content similarity.
In the foregoing solution, the first extraction module is specifically configured to read contents of a source table, determine a number of rows of a header according to the number of rows occupied by a first row of source cells, remove the header according to the number of rows of the header, determine the number of two-dimensional tables into which the source table is split according to the number of rows occupied by a first column of cells, block the source table according to the rows occupied by the first column of cells, where each table block corresponds to one two-dimensional table, traverse the contents of all table blocks, determine the number of rows and the number of columns of the corresponding two-dimensional table, create and initialize each two-dimensional table, read a value of a table empty flag of each table block, determine whether each table block is an abnormal table or a normal table, extract contents of source cells of the table block according to a corresponding extraction rule, and store the contents of source cells in the corresponding two-dimensional table.
In the above scheme, the second extraction module is specifically configured to read and record a row start position, a number of rows and a number of columns of each source cell in an HTML tag of a source table, determine a number of header rows according to a number of rows max occupied by a first row of source cells, normalize the header rows into a two-dimensional table, and extract a title in each row of source cells from the two-dimensional table according to the number of header rows.
In the above scheme, the second extraction module is specifically configured to, when the number of rows of the header is 1, directly extract the header in the source cell in the first row;
when the number of the head lines is 2, extracting the first line and the second line into a line header and a line subtitle, and connecting all the line subtitles to the line header in a # mode;
when the number of rows in the header is 3, the first row, the second row and the third row are extracted as a title, a subtitle and a grandchild title, and the subtitles are all connected to the back of the title in # and the grandchild titles are all connected to the back of the subtitles in # respectively.
In the above scheme, the model building module is specifically configured to define each table cell in the header as a header item when the header only consists of one header row; and when the header comprises a plurality of header rows, splicing according to the hierarchical sequence from top to bottom to obtain a header item.
In the foregoing solution, the model building module is specifically configured to define a header as a vector H, where a header entry is defined as H, where one header includes multiple header entries, which are denoted as H ═ H1, H2., hn >, where n ∈ [1, the number of header entries ];
defining the table content as D, obtaining content parts in the table content by using at least one two-dimensional table formed according to the content of the source table, dividing the table content into rows, defining each row as D, representing the table content as D ═ D1, D2.., dn >, defining the matrix lattice of the ith row and the jth column of the table content matrix as dij, defining the ith row di ═ di1, di 2.., din >, wherein n ∈ [1, number of head items ].
In the above scheme, the alignment module is specifically configured to look up a regular table row of each matrix row in a table content matrix according to the number of table header items of the table processing model, and align the table content corresponding to the regular table row with the table header item; and for the rest of the non-regular table rows, taking the aligned regular table rows as a reference, searching the table contents occupying the same matrix column width in the non-regular table rows in the direction of columns to be aligned with the table head item, performing similarity calculation on the rest of the non-aligned table contents and the aligned table contents, finding the table contents with the highest similarity, aligning the table head item corresponding to the table contents as an alignment target table head item, and iteratively performing the similarity calculation in the unit of columns on the rest of the non-aligned table contents to finish the alignment.
The invention provides a table extraction method and a device, which are characterized in that the content of a source table is read, the source table is stored into at least one two-dimensional table according to the content of the source table, a table header of the source table is read, the table header is extracted according to the number of the table header lines, a table header item is determined according to the extracted table header, a table processing model is established according to the at least one two-dimensional table, and the table content in the table processing model is aligned with the table header item by utilizing the content similarity; thus, compared with the supervised method, the technical scheme of the invention does not need training data; compared with an unsupervised method, the technical scheme of the invention not only utilizes the structure information of the table, but also utilizes the content information of the table, thus the tables with the same structure but different semantics can be extracted correctly.
Drawings
Fig. 1 is a schematic flowchart of a table extraction method according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a table extraction apparatus according to an embodiment of the present invention.
Detailed Description
As the importance of the knowledge base is becoming more and more prominent, many knowledge is expected to be converted into a form of a triple and stored in the knowledge base. One common approach for building a knowledge base is to obtain knowledge from tables, i.e., to extract tables, including header extraction and attribute alignment of table contents.
Since many intellectual forms were not created to build the knowledge base at the time of initial design, many aspects are not directly available to the knowledge base. For example, in the initial stage, in the platform of accumulating knowledge in the form of "crowdsourcing" such as encyclopedia, wikipedia, etc., forms are designed by various users, and their forms are also very different, but forms are a very powerful expression form of many important knowledge, and we must pay high attention to it. Knowledge of many relational databases is also presented in tabular form. For example, in an encyclopedia, the role corresponding relations of a plurality of actors are shown in a table form, and if the table extraction can be correctly carried out, the knowledge entries can be utilized.
In the embodiment of the invention, the content of a source table is read, the source table is stored into at least one two-dimensional table according to the content of the source table, the table head of the source table is read, the table head is extracted according to the number of the table head lines, the table head item is determined according to the extracted table head, a table processing model is established according to the at least one two-dimensional table, and the table content in the table processing model is aligned with the table head item by utilizing the content similarity.
The invention is further described in detail below with reference to the figures and the specific embodiments.
The following key terms are required in the embodiments of the invention:
source cell content: the method comprises the steps of calculating the number of lines occupied by a source cell, the number of columns occupied by the source cell, the text content of the source cell, the line index of the source cell and the column index of the source cell;
two-dimensional table: the source cell table is used for storing the source cell contents in the source table, and the form of the source cell table is n rows and m columns;
atomic cell: a minimum cell in the two-dimensional table;
table cursor: the two-dimensional table is used for storing the content of one row of the two-dimensional table;
table empty row flag: for marking whether more than two empty rows are in the table.
The embodiment of the invention realizes a table extraction method, which is applied to machine equipment such as a server, a PC and the like, and as shown in figure 1, the method comprises the following steps:
step 101: reading the content of the source table, and storing the content into at least one two-dimensional table according to the content of the source table;
specifically, the content of a source table is read, the number of rows of a header is determined according to the number of rows occupied by a first row of source cells, after the header is removed according to the number of rows occupied by a first column of source cells, the number of the source table which is split into two-dimensional tables is determined according to the number of rows occupied by a first column of cells, the source tables are partitioned according to the rows occupied by the first column of cells, each table block corresponds to one two-dimensional table, the content of all table blocks is traversed, the number of rows and the number of columns corresponding to the two-dimensional tables are determined, each two-dimensional table is created and initialized, the value of a table empty row mark of each table block is read, whether each table block is an abnormal table or a normal table is determined, the content of the source cells of the table block is extracted according to a corresponding extraction rule, and the.
Step 101 may include:
1) reading the content of a source table, determining the number of the source table which is split into two-dimensional tables according to the number of lines occupied by cells in a first column of the source table except a header, and partitioning the source table according to the lines occupied by the cells in the first column, wherein each table block corresponds to one two-dimensional table, and taking table 1 as an example, the number of the split normalized two-dimensional tables is 3;
2) traversing the contents of all table blocks according to the table blocks split in the step 1), determining the row number n and the maximum column number m of the corresponding two-dimensional table, and respectively creating and initializing the two-dimensional table, wherein three two-dimensional tables created in the table 1 are 13 rows and 7 columns, 4 rows and 4 columns, and 1 row and 2 columns in sequence;
3) judging a normal table or an abnormal table;
specifically, for the table block split in step 2), checking all the numbers of blank lines in the table block to determine whether the table block is a processable table, where the checked number of blank lines in the table block may be a value of a read table blank mark, and if the value of the table blank mark is greater than or equal to 2, considering that the data of the table block is seriously erroneous, and is not processable, and discarding the data; if the table empty flag value is equal to 1, the table block is considered as an abnormal table, and extraction is performed according to the abnormal table rule; if the table empty row mark is equal to 0, the table block is considered as a normal table, and extraction is performed according to the normal table;
Figure GDA0000748503730000071
Figure GDA0000748503730000081
TABLE 1
4) And extracting and standardizing table contents.
Specifically, for the table blocks split in the step 2), executing a step 5) on all normal tables to realize table content extraction and normalization processing; step 6) is executed for all abnormal tables, and table content extraction and normalization processing are realized;
5) the normal form normalization processing steps are as follows:
5.1) reading the value of the first row of the table block to obtain the content of the source cell, splitting the obtained content of the source cell into the content of the atomic cell according to the row number and the column number occupied by the source cell, and filling the content of the atomic cell into the content of the atomic cell in the cursor of the table;
5.2) passing through the table block, and recording the row number of each row, wherein the row number starts from zero; every time a row is traversed, the row index attribute in the table cursor is reduced by one, so that the content of the next row of the two-dimensional table is obtained;
5.3) assigning the content of each atomic cell in the form vernier in the step 5.2) to the corresponding position of the corresponding row in the two-dimensional table;
and 5.4) obtaining a complete two-dimensional table until the table block is traversed.
6) The abnormal table normalization processing steps are as follows:
6.1) passing through the table block, and reading the content of each source cell of each row;
6.2) splitting the obtained content of the source cell into the content of the atomic cell according to the number of rows and the number of columns occupied by the source cell, and filling the content of the atomic cell into the content of the atomic cell in the cursor; if an empty row is encountered, assigning the content of the previous row to the empty row, and subtracting one from the column index attribute in the table cursor; if the null value is met, filling the null value at the corresponding row mark and the corresponding column mark in the table cursor;
6.3) filling the value of the table cursor obtained in 6.2) in the corresponding position in the two-dimensional table;
6.4) obtaining a complete two-dimensional table until the table block is traversed.
From the above steps and table 1, the following two-dimensional tables, table 2, table 3, table 4, can be finally obtained:
Figure GDA0000748503730000091
Figure GDA0000748503730000101
TABLE 2
Figure GDA0000748503730000102
TABLE 3
Taihong and Australia/International roaming Charging according to Taihong and Australia international roaming charge standard
TABLE 4
Step 102: reading a header of a source table, and extracting the header according to the number of rows of the header;
specifically, the starting position of a line, the number of lines and the number of columns of each source cell in an HTML (hypertext markup language) label of a source table in the Web are read, the information is recorded, the number of rows of a header is determined according to the number of lines max occupied by the source cell in the first line, the header is normalized into a two-dimensional table, and a title in each line of the source cell is extracted from the two-dimensional table according to the number of rows of the header. The normalization of the header into a two-dimensional table here may employ the method of step 101.
Wherein, the determining the head line number according to the line number max occupied by the first line of source cells comprises:
when max is greater than 1, the max value is the number of head rows;
when max is 1 and the first row has only one source unit cell, the number of rows occupied by the source unit cell in the second row is recorded as max ', and the value of the number of head rows max is max' + 1;
when max is 1 and the first row has multiple cells, then the number of head rows is 1.
For example: header a becomes header b after normalization:
Figure GDA0000748503730000103
watch head a
Type of set meal Name of package Open city of land Open object Channel handling Channel handling
Type of set meal Name of package Open city of land Open object Entity channel Electronic channel
Watch head b
Wherein, when the gauge outfit line number is 1, the extraction of gauge outfit is:
if the number of rows of the header is 1, which means that the header occupies one row, the header in the source cell in the first row is directly extracted as the header, for example, table c only has "communication status, tariff" to be extracted as the header.
Figure GDA0000748503730000111
Table c
When the number of rows of the header is 2, the extraction of the header is as follows:
if the number of rows of the header is 2, which indicates that the header occupies two rows, extracting the first row and the second row as row headers and row subtitles, and connecting all the row subtitles to the row headers in # s;
for example, the header of table d would be extracted as: "package name, open object, transaction channel # entity channel # electronic channel".
Figure GDA0000748503730000112
Table d
Here, the method of determining the row header and the row subtitle is to check whether the right side of each cell in the normalized two-dimensional table is the same content, and if the right side of each cell is the same content, the right side of each cell is the same title, and the content below the cell is extracted as the subtitle of the title. For example: the contents of column 5 of row 1 of the header b are "transaction channel", the right is still "transaction channel", and the contents of the two different cells below "transaction channel" are "entity channel" and "electronic channel", then "entity channel" and "electronic channel" are extracted as the subtitles of the "transaction channel" title.
When the number of rows of the gauge head is 3, the gauge head is extracted as follows:
if the number of rows in the header is 3, the header takes three rows, at which time the first, second and third rows are extracted as a header, a subheader and a grandchild header, and the subheaders are all connected # to the back of the header and the grandchild headers are all connected # to the back of the subheader.
For example, table e is drawn as "non-contracted product" shenzhou line ease card series tariff marketing plan (based on the actual online tariff of BOSS) # name # tariff # monthly rented local caller # local callee # 17951 long distance (including local phone access fee) # national roaming caller # face value and "expiration date # remark" in 2007.
Figure GDA0000748503730000121
Table e
Step 103: determining a table head item according to the extracted table head, and establishing a table processing model according to the at least one two-dimensional table;
firstly, the table head item is analyzed from the table head extracted in the second stage, and the alignment target is clear. In practical problems, there are two cases of header, one is that only one header row is included; the other is to include a plurality of header rows. These two cases are handled separately:
when the header is composed of only one header row, each table cell in the header is defined as a header entry, and each header entry is a definition or description of the table content below the header entry, so that the alignment of the header and the table content in this case is to find the corresponding relationship between each header cell and the table content.
When the header contains a plurality of header rows, a hierarchical relationship in meaning is generated between the header rows. That is to say for two adjacent header rows, the header row located above is a summary or abstraction of the corresponding header row below, and the header row below is an externalization or clarification of the corresponding header row above. Therefore, for the header of the multi-header row, the header row at the lowest layer is the core description of the corresponding table content, so that a meaningful header item can be obtained by splicing according to the hierarchical order from top to bottom. Thus, in this case, the alignment of the header and the table contents is embodied in the corresponding relationship of the header entry and the table contents after the concatenation is established.
After the table head item and the alignment target are clarified, a table processing model is established as follows:
(1) a header in a table is defined as a vector H, and header entries are defined as H, where a header contains a number of header entries, denoted as H ═ H1, H2,. once, hn >, where n ∈ [1, number of header entries ].
(2) Defining table contents in one table as D, obtaining content parts in the table contents by using at least one two-dimensional table formed according to the contents of a source table, dividing the table contents into rows, defining each row as D, representing the table contents as D ═ D1, D2.., dn >, defining a matrix lattice of an ith row and a jth column of a table content matrix as dij, defining a di ═ di1, a di 2.., din >, and wherein n ∈ [1 ] is the number of head items.
Step 104: aligning the table contents in the table processing model with the table head items by using the content similarity;
specifically, according to the number of table head items of the table processing model, regular table rows of each matrix row in a table content matrix are searched, table contents corresponding to the regular table rows are aligned with the table head items, a matrix row alignment result is the corresponding relation between all the table head items in the row and the table contents, and by taking the ith row as an example, the alignment result is in the form of { h1: dix, h2: diy.. hn: dz }, wherein x, y and z are column numbers of matrix grids aligned with the table head items respectively, and x, y and z < equalto n;
and for the rest of the non-regular table rows, taking the aligned regular table rows as a reference, searching the table contents occupying the same matrix column width in the non-regular table rows in the direction of columns to be aligned with the table head item, performing similarity calculation on the rest of the non-aligned table contents and the aligned table contents, finding the table contents with the highest similarity, aligning the table head item corresponding to the table contents as an alignment target table head item, and iteratively performing the similarity calculation in the unit of columns on the rest of the non-aligned table contents to finish the alignment. The similarity calculation may be to divide the sentences into words to obtain sentence vectors composed of words, and then calculate the cosine similarity between the sentence vectors.
The table row is structured, namely the matrix row with the table content number equal to the number of the head items in the table header; the non-regular table rows, i.e. the matrix rows whose number of table contents is not equal to the number of table header entries in the table header.
The table alignment comprises the following specific steps:
the method comprises the following steps: all the regular table rows are found, and alignment is realized according to the columns.
If the number of table head items in table 5 is 4, and the number of table contents in the grid area is consistent with the number of table head items, the corresponding relationship between the table head items and the table contents can be directly obtained, see table 6.
Figure GDA0000748503730000131
Figure GDA0000748503730000141
TABLE 5
Figure GDA0000748503730000142
TABLE 6
Step two: taking the regular table row found in the first step as a reference, and then taking the column as a direction, aligning the table head entries occupying the same matrix column width with the table content, such as the shading area in table 7:
Figure GDA0000748503730000143
TABLE 7
Step three: and aligning the rest matrix grids with the head items according to the content maturity.
And carrying out similarity calculation on the unaligned table contents and the aligned table contents, and finding out the table head item corresponding to the table contents with the highest similarity as an alignment target table head item. As shown in table 9, the table contents corresponding to the non-shading areas are subjected to similarity calculation with each table content in table 8, the highest similarity is selected for alignment, and the similarity calculation in column units is performed iteratively on the remaining unaligned table contents, thereby completing the alignment. Eventually, the alignment of all cells to attributes is completed, as shown in Table 10.
Here, a rocking alignment from both ends of the table to the middle is generally used, because the table contents at both ends always correspond to the head entries at both ends.
Figure GDA0000748503730000151
TABLE 8
Figure GDA0000748503730000152
TABLE 9
Figure GDA0000748503730000153
Figure GDA0000748503730000161
Watch 10
In order to implement the above method, the present invention further provides a table extraction apparatus, as shown in fig. 2, the apparatus including: a first extraction module 21, a second extraction module 22, a model building module 23, and an alignment module 24; wherein the content of the first and second substances,
a first extraction module 21, configured to read content of a source table and store the content into at least one two-dimensional table according to the content of the source table;
the second extraction module 22 is configured to read a header of the source table and extract the header according to the number of rows of the header;
the model establishing module 23 is configured to determine a header item according to the extracted header and establish a table processing model according to the at least one two-dimensional table;
and an alignment module 24, configured to align the table contents with the table header in the table processing model by using the content similarity.
Specifically, the first extraction module 21 reads the content of the source table, determines the number of rows of the header according to the number of rows occupied by the first row of source cells, removes the header according to the number of rows occupied by the first row of source cells, determines the number of the source table split into the two-dimensional tables according to the number of rows occupied by the first row of cells, blocks the source table according to the rows occupied by the first row of cells, each table block corresponds to one two-dimensional table, traverses the content of all table blocks, determines the number of rows and the number of columns of the corresponding two-dimensional table, creates and initializes each two-dimensional table, reads the value of a table empty mark of each table block, determines whether each table block is an abnormal table or a normal table, extracts the content of the source cells of the table block according to the corresponding extraction rule, and stores the content of the source cells in the corresponding two-dimensional table.
For a normal table, the first extraction module 21 is specifically configured to read a value of a first row of a table block, obtain contents of a source cell, split the obtained contents of the source cell into contents of an atomic cell according to the number of rows and the number of columns occupied by the source cell, and fill the contents of the atomic cell into the contents of an original sub-cell in a cursor of the table; traversing the table block, recording the row number of each row, starting from zero, and subtracting one from the row index attribute in the table cursor when traversing each row; assigning the content of each atomic cell in the form cursor to the corresponding position of the corresponding row in the two-dimensional form; and obtaining a complete two-dimensional table until the table block is traversed.
For the abnormal table, the first extraction module 21 is specifically configured to traverse the table block, read the content of each source cell in each row, split the obtained content of the source cell into the content of the atomic cell according to the number of rows and the number of columns occupied by the source cell, and fill the content of the atomic cell into the content of the atomic cell in the cursor; if an empty row is encountered, assigning the content of the previous row to the empty row, and subtracting one from the column index attribute in the table cursor; if the null value is met, filling the null value at the corresponding row mark and the corresponding column mark in the table cursor; filling the acquired value of the form cursor in the corresponding position of the two-dimensional form; and obtaining a complete two-dimensional table until the table block is traversed.
The second extraction module 22 is specifically configured to read a row start position, and the number of lines and columns occupied by each source cell in an HTML tag of a source table in the Web, record information of the row start position, determine a head line number according to the line number max occupied by the first line of source cells, normalize the head line number into a two-dimensional table, and extract a title in each line of source cells from the two-dimensional table according to the head line number. Wherein, the determining the head line number according to the line number max occupied by the first line of source cells comprises:
when max is greater than 1, the max value is the number of head rows;
when max is 1 and the first row has only one source unit cell, the number of rows occupied by the source unit cell in the second row is recorded as max ', and the value of the number of head rows max is max' + 1;
when max is 1 and the first row has multiple cells, then the number of head rows is 1.
When the number of rows of the header is 1, the second extraction module 22 directly extracts the header in the source cells in the first row;
when the number of rows in the header is 2, the second extraction module 22 extracts the first row and the second row as row headers and row subtitles, and connects all the row subtitles to the row headers in # form;
when the number of rows in the header is 3, the second extraction module 22 extracts the first row, the second row, and the third row as a title, a subtitle, and a grandchild title, and connects all the subtitles # to the back of the title and all the grandchild titles # to the back of the subtitles.
The model establishing module 23 is specifically configured to define each table cell in the table header as a table header item when the table header only consists of one table header row; when the header comprises a plurality of header rows, header items are obtained by splicing according to the hierarchical sequence from top to bottom; the process of establishing the table processing model comprises the following steps:
(1) a header in a table is defined as a vector H, and header entries are defined as H, where a header contains a plurality of header entries, denoted as H ═ H1, H2,. once, hn >, where n ∈ [1, number of header entries ].
(2) Defining table contents in one table as D, obtaining content parts in the table contents by using at least one two-dimensional table formed according to the contents of a source table, dividing the table contents into rows, defining each row as D, representing the table contents as D ═ D1, D2.., dn >, defining a matrix lattice of an ith row and a jth column of a table content matrix as dij, defining a di ═ di1, a di 2.., din >, and wherein n ∈ [1 ] is the number of head items.
After the table processing model is obtained, the alignment module 24 is specifically configured to look up a regular table row of each matrix row in a table content matrix according to the number of table header items of the table processing model, and align the table content corresponding to the regular table row with the table header item; and for the rest of the non-regular table rows, taking the aligned regular table rows as a reference, searching the table contents occupying the same matrix column width in the non-regular table rows in the direction of columns to be aligned with the table head item, performing similarity calculation on the rest of the non-aligned table contents and the aligned table contents, finding the table contents with the highest similarity, aligning the table head item corresponding to the table contents as an alignment target table head item, and iteratively performing the similarity calculation in the unit of columns on the rest of the non-aligned table contents to finish the alignment.
The table extraction method of the embodiment of the invention can be stored in a computer readable storage medium if the table extraction method is realized in the form of a software functional module and is sold or used as an independent product. Based on such understanding, the technical solutions of the embodiments of the present invention may be essentially implemented or a part contributing to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a magnetic disk, or an optical disk. Thus, embodiments of the invention are not limited to any specific combination of hardware and software.
Correspondingly, the embodiment of the invention also provides a computer storage medium, wherein a computer program is stored, and the computer program is used for executing the table extraction method of the embodiment of the invention.
The above description is only exemplary of the present invention and should not be taken as limiting the scope of the present invention, and any modifications, equivalents, improvements, etc. that are within the spirit and principle of the present invention should be included in the present invention.

Claims (12)

1. A method of table extraction, the method comprising:
reading the content of the source table, and storing the content into at least one two-dimensional table according to the content of the source table;
reading and recording the initial position of the row, the number of rows and the number of columns of each source cell in a hypertext markup language (HTML) label of a source table, determining the number of rows of a header according to the number of rows max occupied by the first row of source cells, normalizing the header into a two-dimensional table, and extracting the title in each row of source cells from the two-dimensional table according to the number of rows of the header to obtain the extracted header;
determining a table head item according to the extracted table head, and establishing a table processing model according to the at least one two-dimensional table;
the table contents are aligned with the table header entries in the table processing model using the content similarity.
2. The method of claim 1, wherein reading the contents of the source table, and storing the contents of the source table as at least one two-dimensional table according to the contents of the source table comprises:
reading the content of a source table, determining the number of head lines according to the number of lines occupied by a first line of source cells, removing the head according to the number of head lines, determining the number of the source table which is split into two-dimensional tables according to the number of lines occupied by a first column of cells, partitioning the source table according to the number of lines occupied by the first column of cells, enabling each table block to correspond to one two-dimensional table, traversing the contents of all the table blocks, determining the number of lines and the number of columns corresponding to the two-dimensional tables, creating and initializing each two-dimensional table, reading the value of a table empty line mark of each table block, determining whether each table block is an abnormal table or a normal table, extracting the source cell content of each table block according to a corresponding extraction rule, and storing the source cell content into the corresponding two-dimensional table.
3. The method of claim 1, wherein extracting the header from the two-dimensional table in terms of the number of head rows in each row of source cells comprises:
when the number of rows of the header is 1, directly extracting the header in the source cell in the first row as the header;
when the number of the head lines is 2, extracting the first line and the second line into a line header and a line subtitle, and connecting all the line subtitles to the line header in a # mode;
when the number of rows in the header is 3, the first row, the second row and the third row are extracted as a title, a subtitle and a grandchild title, and the subtitles are all connected to the back of the title in # and the grandchild titles are all connected to the back of the subtitles in # respectively.
4. The method of claim 1, wherein determining the header entry according to the extracted header comprises:
when the header only consists of one header row, each table cell in the header is defined as a header item;
and when the header comprises a plurality of header rows, splicing according to the hierarchical sequence from top to bottom to obtain a header item.
5. The method of claim 4, wherein said building a form processing model from said at least one two-dimensional form comprises:
defining a header as a vector H, the header entries being defined as H, wherein a header contains a plurality of header entries, denoted as H ═ H1, H2.., hn >, wherein n ∈ [1, number of header entries ];
defining the table content as D, obtaining content parts in the table content by using at least one two-dimensional table formed according to the content of the source table, dividing the table content into rows, defining each row as D, representing the table content as D ═ D1, D2.., dn >, defining the matrix lattice of the ith row and the jth column of the table content matrix as dij, defining the ith row di ═ di1, di 2.., din >, wherein n ∈ [1, number of head items ].
6. The method of claim 5, wherein aligning table contents with table entries in a table processing model using content similarity comprises:
according to the number of the table head items of the table processing model, searching a regular table row of each matrix row in a table content matrix, and aligning the table content corresponding to the regular table row with the table head items; the regular table row refers to a matrix row with the number of table contents equal to the number of table head entries in the table head;
for the rest of the non-regular table rows, taking the aligned regular table rows as a reference, searching table contents occupying the same matrix column width in the non-regular table rows in the direction of columns to be aligned with the header items, performing similarity calculation on the rest of the non-aligned table contents and the aligned table contents, finding the table contents with the highest similarity, aligning the header items corresponding to the table contents as alignment target header items, and iteratively performing similarity calculation in the unit of columns on the rest of the non-aligned table contents to complete alignment; the non-regular table row refers to a matrix row with the number of table contents not equal to the number of table head entries in the table head.
7. A form extraction apparatus, comprising: the system comprises a first extraction module, a second extraction module, a model building module and an alignment module; wherein the content of the first and second substances,
the first extraction module is used for reading the content of the source table and storing the content into at least one two-dimensional table according to the content of the source table;
the second extraction module is used for reading and recording the row starting position, the number of lines and the number of columns of each source cell in an HTML (hypertext markup language) label of the source table, determining the number of head lines according to the number of lines max occupied by the first line of source cells, normalizing the head lines into a two-dimensional table, and extracting the title in each line of source cells from the two-dimensional table according to the number of head lines;
the model establishing module is used for determining a table head item according to the extracted table head and establishing a table processing model according to the at least one two-dimensional table;
and the alignment module is used for aligning the table content and the table head item in the table processing model by using the content similarity.
8. The apparatus according to claim 7, wherein the first extraction module is specifically configured to read contents of a source table, determine a number of rows of a header according to the number of rows occupied by a first row of source cells, remove the header according to the number of rows of the header, determine the number of the source table split into two-dimensional tables according to the number of rows occupied by a first column of cells, block the source table according to the rows occupied by the first column of cells, each table block corresponds to one two-dimensional table, traverse contents of all table blocks, determine the number of rows and the number of columns of the corresponding two-dimensional table, create and initialize each two-dimensional table, read a value of a table empty flag of each table block, determine whether each table block is an abnormal table or a normal table, extract contents of source cells of the table block according to a corresponding extraction rule, and store the contents of the source cells in the corresponding two-dimensional table.
9. The apparatus according to claim 7, wherein the second extraction module is specifically configured to, when the number of rows of the header is 1, directly extract the header from the source cells in the first row;
when the number of the head lines is 2, extracting the first line and the second line into a line header and a line subtitle, and connecting all the line subtitles to the line header in a # mode;
when the number of rows in the header is 3, the first row, the second row and the third row are extracted as a title, a subtitle and a grandchild title, and the subtitles are all connected to the back of the title in # and the grandchild titles are all connected to the back of the subtitles in # respectively.
10. The apparatus according to claim 7, wherein the model building module is specifically configured to define each table cell in the header as a header entry when the header consists of only one header row; and when the header comprises a plurality of header rows, splicing according to the hierarchical sequence from top to bottom to obtain a header item.
11. The apparatus according to claim 10, wherein the model building module is configured to define a header as a vector H and header entries as H, wherein a header contains a plurality of header entries, denoted as H ═ H1, H2.., hn >, where n ∈ [1, number of header entries ];
defining the table content as D, obtaining content parts in the table content by using at least one two-dimensional table formed according to the content of the source table, dividing the table content into rows, defining each row as D, representing the table content as D ═ D1, D2.., dn >, defining the matrix lattice of the ith row and the jth column of the table content matrix as dij, defining the ith row di ═ di1, di 2.., din >, wherein n ∈ [1, number of head items ].
12. The apparatus according to claim 11, wherein the alignment module is specifically configured to look up a regular table row of each matrix row in a table content matrix according to the number of table head entries of the table processing model, and align table contents corresponding to the regular table row with the table head entry; for the rest of the non-regular table rows, taking the aligned regular table rows as a reference, searching table contents occupying the same matrix column width in the non-regular table rows in the direction of columns to be aligned with the header items, performing similarity calculation on the rest of the non-aligned table contents and the aligned table contents, finding the table contents with the highest similarity, aligning the header items corresponding to the table contents as alignment target header items, and iteratively performing similarity calculation in the unit of columns on the rest of the non-aligned table contents to complete alignment; the regular table row refers to a matrix row with the number of table contents equal to the number of table head entries in the table head; the non-regular table row refers to a matrix row with the number of table contents not equal to the number of table head entries in the table header.
CN201510205847.8A 2015-04-27 2015-04-27 Table extraction method and device Active CN106156239B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510205847.8A CN106156239B (en) 2015-04-27 2015-04-27 Table extraction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510205847.8A CN106156239B (en) 2015-04-27 2015-04-27 Table extraction method and device

Publications (2)

Publication Number Publication Date
CN106156239A CN106156239A (en) 2016-11-23
CN106156239B true CN106156239B (en) 2020-06-30

Family

ID=57347527

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510205847.8A Active CN106156239B (en) 2015-04-27 2015-04-27 Table extraction method and device

Country Status (1)

Country Link
CN (1) CN106156239B (en)

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106845467B (en) * 2016-12-14 2019-07-19 北京航天测控技术有限公司 Aeronautical maintenance work card action recognition methods based on optical character recognition technology
CN106777259A (en) * 2016-12-28 2017-05-31 深圳市华傲数据技术有限公司 The method and device of structured message in adaptive decimation HTML Table labels
CN106709032B (en) * 2016-12-29 2019-12-20 深圳市华傲数据技术有限公司 Method and device for extracting structured information in electronic form document
CN107943968B (en) * 2017-11-28 2020-05-12 北京筑业志远软件开发有限公司 Structured processing method for construction data table data
CN109284495B (en) * 2018-11-03 2023-02-07 上海犀语科技有限公司 Method and device for performing table-free line table cutting on text
CN111310082B (en) * 2018-12-11 2023-09-29 杭州海康威视系统技术有限公司 Page display method and device
CN110083815B (en) * 2019-05-07 2023-05-23 中冶赛迪信息技术(重庆)有限公司 Synonymous variable identification method and system
CN110188107B (en) * 2019-06-05 2020-05-01 中科鼎富(北京)科技发展有限公司 Method and device for extracting information from table
CN110362620B (en) * 2019-07-11 2021-04-06 南京烽火星空通信发展有限公司 Table data structuring method based on machine learning
CN110502516B (en) * 2019-08-22 2021-10-19 深圳前海环融联易信息科技服务有限公司 Table data analysis method and device, computer equipment and storage medium
CN111401010B (en) * 2020-03-25 2023-07-28 苏州机数芯微科技有限公司 Form extraction method based on machine learning
CN111797356B (en) * 2020-07-06 2023-08-08 上海冰鉴信息科技有限公司 Webpage form information extraction method and device
CN112395418B (en) * 2020-11-26 2021-09-03 上海携宁计算机科技股份有限公司 Method and device for extracting target object in webpage and electronic equipment
CN113656592B (en) * 2021-07-22 2022-09-27 北京百度网讯科技有限公司 Data processing method and device based on knowledge graph, electronic equipment and medium
CN114428839A (en) * 2022-01-27 2022-05-03 北京百度网讯科技有限公司 Data processing method, paragraph text determination device and electronic equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101556606A (en) * 2009-05-20 2009-10-14 同方知网(北京)技术有限公司 Data mining method based on extraction of Web numerical value tables
CN101615193A (en) * 2009-07-07 2009-12-30 北京大学 A kind of based on the integrated inquiry system of encyclopaedia data extract
CN101794280A (en) * 2010-03-11 2010-08-04 北京中科辅龙计算机技术股份有限公司 Form automatic generation method and system based on form template set

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8239750B2 (en) * 2008-09-15 2012-08-07 Erik Thomsen Extracting semantics from data

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101556606A (en) * 2009-05-20 2009-10-14 同方知网(北京)技术有限公司 Data mining method based on extraction of Web numerical value tables
CN101615193A (en) * 2009-07-07 2009-12-30 北京大学 A kind of based on the integrated inquiry system of encyclopaedia data extract
CN101794280A (en) * 2010-03-11 2010-08-04 北京中科辅龙计算机技术股份有限公司 Form automatic generation method and system based on form template set

Also Published As

Publication number Publication date
CN106156239A (en) 2016-11-23

Similar Documents

Publication Publication Date Title
CN106156239B (en) Table extraction method and device
US8166037B2 (en) Semantic reconstruction
Sarkhel et al. Visual segmentation for information extraction from heterogeneous visually rich documents
CN102314497B (en) Method and equipment for identifying body contents of markup language files
CN112395418B (en) Method and device for extracting target object in webpage and electronic equipment
CN106407195B (en) Method and system for web page duplication elimination
CN100432996C (en) System, method and program for extracting web page core content based on web page layout
CN111061742A (en) Method and device for marking data and service system thereof
CN115391439B (en) Document data export method, device, electronic equipment and storage medium
CN114118053A (en) Contract information extraction method and device
CN112434533A (en) Entity disambiguation method, apparatus, electronic device, and computer-readable storage medium
CN103336850A (en) Method and device for confirming index word in database retrieval system
CN113591476A (en) Data label recommendation method based on machine learning
US20220198133A1 (en) System and method for validating tabular summary reports
CN110909532B (en) User name matching method and device, computer equipment and storage medium
CN107861950A (en) The detection method and device of abnormal text
CN109558580B (en) Text analysis method and device
CN115952800A (en) Named entity recognition method and device, computer equipment and readable storage medium
US8719693B2 (en) Method for storing localized XML document values
CN111708891B (en) Food material entity linking method and device between multi-source food material data
CN114238654A (en) Knowledge graph construction method and device and computer readable storage medium
CN109815996B (en) Scene self-adaptation method and device based on recurrent neural network
CN113779218B (en) Question-answer pair construction method, question-answer pair construction device, computer equipment and storage medium
CN109657180B (en) Intelligent automatic fuzzy extraction system for webpage content
CN112749186B (en) Data processing method, device, electronic equipment and computer readable storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant