CN117291152A - Table extraction method and apparatus - Google Patents

Table extraction method and apparatus Download PDF

Info

Publication number
CN117291152A
CN117291152A CN202210692042.0A CN202210692042A CN117291152A CN 117291152 A CN117291152 A CN 117291152A CN 202210692042 A CN202210692042 A CN 202210692042A CN 117291152 A CN117291152 A CN 117291152A
Authority
CN
China
Prior art keywords
content
cell
text
target
original
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210692042.0A
Other languages
Chinese (zh)
Inventor
张治强
熊龙飞
段纪伟
黄旭进
侯冰基
邓灿赏
张炜杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Kingsoft Office Software Inc
Zhuhai Kingsoft Office Software Co Ltd
Wuhan Kingsoft Office Software Co Ltd
Original Assignee
Beijing Kingsoft Office Software Inc
Zhuhai Kingsoft Office Software Co Ltd
Wuhan Kingsoft Office Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Kingsoft Office Software Inc, Zhuhai Kingsoft Office Software Co Ltd, Wuhan Kingsoft Office Software Co Ltd filed Critical Beijing Kingsoft Office Software Inc
Priority to CN202210692042.0A priority Critical patent/CN117291152A/en
Publication of CN117291152A publication Critical patent/CN117291152A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation
    • G06F40/154Tree transformation for tree-structured or markup documents, e.g. XSLT, XSL-FO or stylesheets
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/174Form filling; Merging
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/177Editing, e.g. inserting or deleting of tables; using ruled lines
    • G06F40/18Editing, e.g. inserting or deleting of tables; using ruled lines of spreadsheets

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The invention discloses a method and a device for extracting a form. The method comprises the following steps: identifying an original table to be extracted in the image, and extracting a cell structure of the original table, wherein the content in the original table is not editable; extracting text attributes in the original form; converting the cell structure and text attributes into a hypertext markup language description; and analyzing the hypertext markup language description to obtain a target table, wherein the content in the target table can be edited. The invention solves the technical problem that the table data in the image can not be converted into the editable table data.

Description

Table extraction method and apparatus
Technical Field
The invention relates to the field of computers, in particular to a method and a device for extracting a table.
Background
In the prior art, for images including forms, such as pictures, photographs, portable documents (Portable Document Format, PDF), etc., the forms therein are not editable due to the specificity of the image format. If the user needs to acquire form data of a form in the image, a form file needs to be manually created, and then data is manually input in the form file against the form data of the form in the image.
That is, in the prior art, the table data of the table in the image cannot be converted into the editable table data, so that the user cannot quickly acquire the table data of the table in the image.
Disclosure of Invention
The embodiment of the invention provides a method and a device for extracting a form, which at least solve the technical problem that form data in an image cannot be converted into editable form data.
According to an aspect of an embodiment of the present invention, there is provided a method for extracting a table, including: identifying an original table to be extracted in an image, and extracting a cell structure of the original table, wherein the content in the original table is not editable; extracting text attributes in the original table; converting the cell structure and the text attribute into a hypertext markup language description; and analyzing the hypertext markup language description to obtain a target table, wherein the content in the target table can be edited.
As an optional example, the identifying the original table to be extracted in the image, and extracting the cell structure of the original table includes: identifying each form line of the original form in the image; combining the identified table lines to obtain the relationship between the position of each cell in the original table and the cell in the original table; and taking the relation between the position of each cell in the original table and the cell in the original table as the cell structure of the original table.
As an optional example, the identifying each table line of the original table in the image includes: identifying each straight line in the image to obtain a plurality of straight lines; and selecting a rectangle with the largest area from the rectangles formed by the plurality of straight lines, and taking the side of the rectangle with the largest area and the straight line surrounded by the rectangle with the largest area as the table grid line.
As an optional example, the extracting text attributes in the original table includes: identifying each character in the image, and taking the identified character as text content of the original table when the character is identified and the character is positioned in the original table; determining a target cell where each text content is located; identifying a location of each of the text content in the target cell; and taking the text content, a target cell where the text content is located and the position of the text content in the target cell as the text attribute of the original table.
As an optional example, the identifying the location of each of the text contents in the target cell includes: taking each text content as the current text content, and executing the following operations on the current text content: identifying a distance between the current text content and a boundary of the target cell; and determining the position of the current text content in the target cell according to the distance between the current text content and the boundary of the target cell.
As an alternative example, the converting the cell structure and the text attribute into a hypertext markup language description includes: inserting the text content in the text attribute into the cell structure according to a target cell where the text content indicated by the text attribute is located and the position of the text content in the target cell, so as to obtain a data structure comprising the text content and the cell structure; reading the content of the data structure by rows, and converting the content of the data structure into the hypertext markup language description.
As an alternative example, the reading the content of the data structure by rows, converting the content of the data structure into the hypertext markup language description includes: calculating the number of rows or columns occupied by the merged cells when the merged cells exist in the data structure; in the process of sequentially reading the cells of each row, under the condition of reading the merged cells for the first time, converting the content of the data structure of the merged cells into the hypertext markup language description; when the merged cell is read again, the merged cell is skipped according to the number of rows or the number of columns.
As an optional example, parsing the hypertext markup language description to obtain the target form includes: creating an editable form file; and describing the content filled in the original table in the editable table file according to the hypertext markup language to obtain the target table.
As an optional example, the filling the content of the original table in the editable table file according to the hypertext markup language description, and obtaining the target table includes: reading code content in the hypertext markup language description line by line; under the condition that a preset code is read, acquiring an object code after the preset code, wherein the object code records table information of the object table; and writing the target table in the table file which can be edited according to the table information recorded in the target code.
According to another aspect of an embodiment of the present invention, there is provided an extraction apparatus for a form, including: the identification module is used for identifying an original table to be extracted in the image, and extracting a cell structure of the original table, wherein the content in the original table is not editable; the extraction module is used for extracting text attributes in the original table; the conversion module is used for converting the cell structure and the text attribute into a hypertext markup language description; and the analysis module is used for analyzing the hypertext markup language description to obtain a target table, wherein the content in the target table can be edited.
As an alternative example, the above-mentioned identification module includes: a first identifying unit for identifying each table line of the original table in the image; a combination unit, configured to combine the identified table lines to obtain a relationship between a position of each cell in the original table and a cell in the original table; and the first determining unit is used for taking the relation between the position of each cell in the original table and the cell in the original table as the cell structure of the original table.
As an alternative example, the first identifying unit includes: the identification subunit is used for identifying each straight line in the image to obtain a plurality of straight lines; and a determination subunit configured to select a rectangle with a largest area from the rectangles formed by the plurality of straight lines, and set a side of the rectangle with the largest area and a straight line surrounded by the rectangle with the largest area as the grid line.
As an optional example, the extracting module includes: a second recognition unit configured to recognize each character in the image, and in a case where the character is recognized and the character is located within the original form, to use the recognized character as text content of the original form; a second determining unit, configured to determine a target cell where each text content is located; a third recognition unit for recognizing a position of each of the text contents in the target cell; and a third determining unit, configured to use the text content, a target cell where the text content is located, and a position of the text content in the target cell as a text attribute of the original table.
As an optional example, the third identifying unit includes: the processing subunit is used for taking each text content as the current text content and executing the following operations on the current text content: identifying a distance between the current text content and a boundary of the target cell; and determining the position of the current text content in the target cell according to the distance between the current text content and the boundary of the target cell.
As an alternative example, the above-mentioned conversion module includes: an inserting unit, configured to insert the text content in the text attribute into the cell structure according to a target cell where the text content indicated by the text attribute is located and a position of the text content in the target cell, to obtain a data structure including the text content and the cell structure; and the reading unit is used for reading the content of the data structure according to the line and converting the content of the data structure into the hypertext markup language description.
As an alternative example, the reading unit includes: a calculating subunit, configured to calculate, when there is a merged cell in the data structure, a number of rows or columns occupied by the merged cell; a conversion subunit, configured to convert, in a process of sequentially reading cells in each row, contents of a data structure of the merged cell into the hypertext markup language description in a case of first reading the merged cell; when the merged cell is read again, the merged cell is skipped according to the number of rows or the number of columns.
As an optional example, the parsing module includes: a creation unit for creating an editable form file; and a filling unit, configured to obtain the target table by filling the content of the original table into the editable table file according to the hypertext markup language description.
As an optional example, the parsing module includes: the filling unit includes: a reading subunit, configured to read the code content in the hypertext markup language description line by line; an obtaining subunit, configured to obtain, when a preset code is read, an object code after the preset code, where the object code records table information of the object table; and a writing subunit configured to write the target table in the table file that is editable according to the table information recorded in the target code.
According to still another aspect of the embodiments of the present invention, there is also provided a storage medium having a computer program stored therein, wherein the computer program when executed by a processor performs the method of extracting a table as described above.
According to still another aspect of the embodiments of the present invention, there is also provided an electronic device including a memory in which a computer program is stored, and a processor configured to execute the above-described table extraction method by the above-described computer program.
In the embodiment of the invention, an original table to be extracted in an image is identified, and a cell structure of the original table is extracted, wherein the content in the original table is not editable; extracting text attributes in the original table; converting the cell structure and the text attribute into a hypertext markup language description; the method for obtaining the target table by analyzing the hypertext markup language description, wherein the content in the target table can be edited, and in the method, the cell structure of the original table can be firstly extracted when the original table in the image is identified, then the text attribute of the original table is extracted, then the cell structure and the text attribute are converted into the hypertext markup language description, and finally the hypertext markup language description is analyzed to obtain the target table, so that the aim of converting the table data in the image into the editable table data is fulfilled, and the technical problem that the table data in the image cannot be converted into the editable table data is solved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiments of the invention and together with the description serve to explain the invention and do not constitute a limitation on the invention. In the drawings:
FIG. 1 is a flow chart of an alternative form extraction method according to an embodiment of the invention;
FIG. 2 is an original table schematic diagram of an alternative table extraction method according to an embodiment of the invention;
FIG. 3 is an extracted form line schematic diagram of an alternative form extraction method according to an embodiment of the invention;
FIG. 4 is an original table diagram of an alternative table extraction method according to an embodiment of the invention;
FIG. 5 is a schematic illustration of an alternative form extraction method for HTML description in accordance with an embodiment of the present invention;
FIG. 6 is a schematic diagram of a target table of an alternative table extraction method according to an embodiment of the invention;
FIG. 7 is a schematic diagram of an alternative form extraction apparatus according to an embodiment of the present invention;
fig. 8 is a schematic diagram of an alternative electronic device according to an embodiment of the invention.
Detailed Description
In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
According to a first aspect of an embodiment of the present invention, there is provided a method for extracting a table, optionally, as shown in fig. 1, the method includes:
s102, identifying an original table to be extracted in an image, and extracting a cell structure of the original table, wherein the content in the original table is not editable;
S104, extracting text attributes in the original form;
s106, converting the cell structure and the text attribute into a hypertext markup language description;
s108, analyzing the hypertext markup language description to obtain a target table, wherein the content in the target table can be edited.
Alternatively, the image in this embodiment may be a picture, a photograph, or a file in a portable document (Portable Document Format, PDF) format (e.g., a scanned version PDF file). The image contains an original table to be extracted, and the content in the original table is not editable. Note that, the non-editable information mentioned in this embodiment means that the contents of the table in the image cannot be changed, i.e., modified, added and deleted. But editable, meaning that the contents of the target table may change. I.e., can be modified, added and deleted.
Alternatively, the present embodiment may be applied in a process of identifying a form in an image. One or more tables may be included in the image, and a portion of the table or the entire table may be included. The table in the image cannot be edited. By adopting the method, the table in the image is extracted as the editable target table.
In the embodiment of the invention, when the table in the image is identified, the cell structure of the table can be extracted firstly, then the text attribute of the table is extracted, then the cell structure and the text attribute are converted into the hypertext markup language description, when the cell structure is extracted, the purpose is to record the position and the relation of the cells in the table, and when the text attribute is extracted, the purpose is to record the text in the table and the relation of the text and the cells. Then converting the cell structure and the text attribute into a hypertext markup language description, and analyzing the hypertext markup language description after converting the hypertext markup language description into the hypertext markup language description to obtain the editable target form. The target table may be stored in a format of a table file and may be stored in a format of a compressed package.
According to the method, the target table capable of being marked can be obtained according to the analysis of the non-editable table in the image, and a user can edit the target table or use data in the target table, so that the purpose of converting the table data in the image into the editable table data is achieved.
As an alternative example, identifying an original table to be extracted in an image, extracting a cell structure of the original table includes:
Identifying each form line of the original form in the image;
combining the identified table lines to obtain the relationship between the position of each cell in the original table and the cell in the original table;
and taking the relation between the position of each cell in the original table and the cell in the original table as a cell structure of the original table.
Alternatively, in this embodiment, when identifying the original table in the image, each table line in the original table may be identified. And connecting, splicing and de-duplicating the identified table lines, so as to combine and obtain a plurality of cells. And recording the positions of the cells and the relationship between the cells and the cells, so that the cell structure of the original table can be obtained. According to the method, the relation between the positions of the cells and the relation between the positions of the cells and the relation between the cells can be obtained by identifying the lines of the original form, and then the target form is obtained according to the relation, so that the purpose of converting form data in an image into editable form data is achieved.
As an alternative example, identifying each table line of the original table in the image includes:
Identifying each straight line in the image to obtain a plurality of straight lines;
the rectangle with the largest area is selected from the rectangles formed by a plurality of straight lines, and the side of the rectangle with the largest area and the straight line surrounded by the rectangle with the largest area are taken as table grid lines.
Optionally, in this embodiment, when identifying the table grid lines of the original table in the image, the curvature of all the identifiable lines may be identified, and if the curvature is greater than a preset value, the lines may be considered as curves and not used as the table grid lines of the original table. If the curvature is smaller than or equal to the preset value, the line can be considered as a straight line, the side of the rectangle with the largest area in all the straight lines is taken as the table line in the original table, and the straight line surrounded by the side of the rectangle with the largest area is taken as the table line in the original table.
In this process, if the tabular lines of the table extend to the edges of the pages in the image, the edges of the image are considered to be one tabular line in the original table. The edge here may be the boundary of one virtual sheet of the image or may be a dividing line at a distance from the boundary of the one virtual sheet. The dividing line divides an area where the dummy paper can fill the content and an area where the content cannot be filled.
In this embodiment, when recognizing the line of the original table, a part of the line not belonging to the original table is screened out according to the curvature of the line and the maximum rectangle enclosed by the straight line, thereby achieving the effect of improving the accuracy of recognizing the original table.
As an alternative example, extracting text attributes in the original form includes:
identifying each character in the image, and taking the identified character as the text content of the original table when the character is identified and the character is positioned in the original table;
determining a target cell where each text content is located;
identifying a location of each text content in the target cell;
and taking the text content, the target cell where the text content is located and the position of the text content in the target cell as the text attribute of the original table.
Alternatively, in this embodiment, when text content in an image is recognized, the position of each character may be recognized. If the character is outside the original form, the character is not considered to belong to the original form. If the character is located within the original form, the character is considered to be text content in the original form.
The character may be identified if the character to be identified is located within the original form, and not if the character to be identified is located outside the original form. The determination of whether the character to be recognized is located inside or outside the original form may determine the positional relationship of the character to be recognized, such as a character, a letter, and a symbol, and a rectangle having the largest area formed by the lines of the recognized original form. Characters that are not surrounded by the rectangle with the largest area are considered to be outside the original form.
In this embodiment, when recognizing the characters in the image, it may be determined whether the characters are located in the original table, so that the characters outside the original table may be filtered out, and the accuracy and efficiency of recognizing the characters in the original table are improved.
When recognizing characters, letters and symbols can be recognized, and multi-national languages can be recognized. During recognition, all text contents can be recognized at one time according to traversing recognition of a sequence line by line. Or may be traversed by type. First, identifying one type of text, recording the cell where the text is located, and recording the front-back relation between the text, if other types of text exist between the texts, recording the lengths of the other types of text. For example, a first pass identifies chinese text, a chinese text identification pass identifies alphabetical text, then identifies characters, and through multiple passes, identifies all types of text content. And combining the text contents of the multiple types according to the relation between the positions and the lengths of the other types of the recorded texts to obtain the identified text contents.
After the text content is identified, the location of each text content in the target cell is recorded, thereby determining the text attribute.
In this embodiment, when identifying text content, each text, letter and symbol are identified, so that all text content can be identified, and the integrity of the identified text content is ensured. And in the process of recognition, the content outside the original table is distinguished from the content inside the original table, and finally, the position of the text content is recorded so as to obtain the text attribute, thereby ensuring the accuracy of the recognized characters and the accuracy of the generated text attribute.
As an alternative example, identifying the location of each text content in the target cell includes:
taking each text content as the current text content, and executing the following operations on the current text content:
identifying a distance between the current text content and a boundary of the target cell;
and determining the position of the current text content in the target cell according to the distance between the current text content and the boundary of the target cell.
Optionally, in this embodiment, when identifying the position of the text content in the target cell, the distances between the text content and the four boundaries of the target cell may be identified, so as to accurately locate the position of the text content in the target cell. If the distance between the text content and the upper boundary is the upper distance, the distance between the text content and the lower boundary is the lower distance, the distance between the text content and the left boundary is the left distance, the distance between the text content and the right boundary is the right distance, and the position of the text content in the target cell is determined according to at least one of the upper distance, the lower distance, the left distance and the right distance.
Through the embodiment, the position of the text content in the target cell can be accurately identified, the accuracy of the identified position of the text content is ensured, and the accuracy of the determined text attribute is further ensured.
As an optional example, the identifying the distance between the current text content and the boundary of the target cell includes:
identifying the distances between the current text content and the upper, lower, left and right sides of the target cell to obtain an upper distance, a lower distance, a left distance and a right distance;
determining that the text content is aligned on the target cell if the upper distance is equal to a first threshold;
determining that the text content is aligned down in the target cell if the down distance is equal to the first threshold;
determining that the text content is left aligned in the target cell if the left distance is equal to the first threshold;
determining that the text content is right aligned in the target cell if the right distance is equal to the first threshold;
determining that the text content is horizontally centered in the target cell if the left distance is equal to the right distance;
And determining that the text content is vertically centered in the target cell if the upper distance is equal to the lower distance.
In this embodiment, when determining the position of the current text content in the target cell, the distance between the current text content and the boundary of the target cell may be checked, and then the position of the text content in the target cell may be determined according to the distance. The distance is the shortest distance, and for example, may be the distance from the leftmost edge of the text content to the left edge of the target cell. Then, according to the distance, the position of the text content in the target cell is determined and recorded.
In this embodiment, the position of the text content in the target cell is determined by the distance, so that when the number of text contents is large, the specific position of the text content in the target cell can be rapidly positioned, and the efficiency of positioning the position of the text content is improved.
As an alternative example, converting the cell structure and text attributes into a hypertext markup language description includes:
inserting the text content in the text attribute into the cell structure according to the target cell where the text content indicated by the text attribute is located and the position of the text content in the target cell, so as to obtain a data structure comprising the text content and the cell structure;
Reading the content of the data structure on a row-by-row basis, converting the content of the data structure into a hypertext markup language description.
In this embodiment, after the cell structure and the text attribute are obtained, the cell structure records the positions of the cells and the relationships between the cells, and the text attribute records the text content and the positions of the text content in the cells, so that the text content can be inserted into the cell structure according to the information to obtain the data structure. The original table is converted into a hypertext markup language description by reading the data structure line by line.
In this embodiment, the contents of the data structure are read according to the rows, and the contents of the data structure are converted into the hypertext markup language description, so that the contents of the data structure can be completely traversed without omission, and meanwhile, the contents of the data structure are converted into the hypertext markup language description, so that the conversion of the form in the image into the web page format can be realized, and the subsequent step of converting the non-editable original form into the editable target form can be assisted.
As an alternative example, reading the contents of the data structure on a row-by-row basis, converting the contents of the data structure to a hypertext markup language description includes:
Calculating the number of rows or columns occupied by the merged cells in the case that the merged cells exist in the data structure;
in the process of sequentially reading the cells of each row, under the condition of reading the merged cells for the first time, converting the content of the data structure of the merged cells into a hypertext markup language description;
when the merged cell is read again, the merged cell is skipped in terms of the number of rows or columns.
In this embodiment, in the process of reading the content of the data structure line by line and converting the content into the description of the hypertext markup language, if a cell is a cell obtained by combining a plurality of basic cells, the cell is read once when being read. That is, a cell is merged, and the number of rows or columns of the cell is recorded. If first read, the merged cell is converted to a hypertext markup language description. If the cell is read again, the merged cell is skipped according to the number of rows or columns, and the merged cell is not read repeatedly.
In this embodiment, the purpose is to avoid repeated reading of the text content of the merged cells, and achieve the effect of ensuring the accuracy of the read data.
As an alternative example, parsing the hypertext markup language description to obtain a target form includes:
creating an editable form file;
and describing the contents filled in the original form in the editable form file according to the hypertext markup language to obtain the target form.
Alternatively, the editable form file in this embodiment may be a form format file, or may be a file in another format but including an editable form. For example, the file in the form format may be a file in an excel format, and the file in the other format may be a file in a format such as a word, a PPT (PowerPoint), or the like, but a form is inserted into the file, so as to obtain an editable form file.
After the hypertext markup language description is obtained, in this embodiment, an editable table file may be created, and then the hypertext markup language description is filled into the table file to obtain an editable target table. The target table may be stored or compressed. By the method, the purpose of converting the original form of the webpage format into the editable target form is achieved, so that the target form can be edited or saved.
As an alternative example, describing the contents of the original form in the editable form file in terms of the hypertext markup language, obtaining the target form includes:
Reading code content in the hypertext markup language description line by line;
under the condition that the preset codes are read, acquiring target codes after the preset codes, wherein the target codes record form information of a target form;
and writing the target table in the editable table file according to the table information recorded in the target code.
Alternatively, in this embodiment, the code content in the HyperText markup Language (HTML) description may be read line by line. When the code content in the hypertext markup language description is read line by line, if the preset code is read (the preset code is a preset code), acquiring the target code after the preset code (taking the code content after the preset code as the target code, and setting the length or the number of code lines), acquiring the table information recorded by the target code after the target code is acquired, and filling the table information into an editable table file to serve as a target table.
According to the method, the editable form file can be obtained according to the hypertext markup language description, and the purpose of converting an original form which cannot be edited into a target form which can be edited is achieved.
In this embodiment, the table grid lines are segmented by the table grid line segmentation algorithm, and then the table cell structure is restored by the rule. And then filling the detected and identified text information in the table into the corresponding table. The identified forms and characters are expressed into hypertext markup language description (description of HTML format), and the description of the HTML format is output into the form format, for example, xlsx format and written into a file, so that the original form which cannot be edited can be converted into the target form which can be edited, a user can edit the target form, complex operation when the user needs form data is avoided, and the efficiency of obtaining the form data by the user is improved.
In this embodiment, taking a PDF file with an image as a scan version as an example, table data in the PDF file may be obtained.
When identifying the cell structure, obtaining the cell structure of the original table according to the dividing result of the table grid lines and through a series of rules;
as shown in fig. 2, fig. 2 is a diagram in an original PDF file, that is, an original table, and fig. 3 is a table line segmentation diagram obtained through table line detection, and a structure of the table is extracted without recognizing text.
In this embodiment, one cell may be represented by one rectangular box with a circle. That is, the cells are marked by circles, and each cell is marked by a circle (not shown in fig. 3 for clarity of the drawing). The information in the structural description of the table is: the location of each cell, the relationship between cells (e.g., co-row cells, adjacent cells).
After extracting the cell structure, text attributes are extracted. Filling the text detection and recognition results into corresponding cells, completing matching of the cells and the text, and obtaining the structure and the content description of the table (namely the filled content in the cells);
and obtaining a text detection result from the original image through a text detection model, and filling the result into the cell structure. Each cell in the text and table structure is matched in order to obtain which cell a certain text belongs to, and the location of the text in the cell (e.g., centered or right or left aligned, etc.).
Then converting the structure and content description of the table into HTML description; the table structure and the content description are a data structure, wherein cells are stored according to rows, if merging cells exist, the number of rows or columns occupied by the merging cells is calculated, and in addition, text content in the cells comprises: optical character recognition (Optical Character Recognition, ocr) information, belonging cell information, relative position information (e.g. centered, left-hand, right-hand) with respect to the upper left corner of the belonging cell, from which the restored table structure and content description can be converted into an HTML description. FIG. 4 is an exemplary original table, the contents of which do not affect the implementation of the method. Fig. 5 is a schematic diagram of converting an original form into an HTML description.
Finally, the HTML description is converted into an xlsx format file. In this step, the HTML file may be parsed to obtain information such as text content, text format (e.g., text color, font size, font style), table structure, and table style (e.g., format of table grid lines, i.e., dotted lines or solid lines).
Converting the content obtained by analysis into a description file with the requirement of xlsx format; during conversion, the description file of xlsx, the resource file corresponding to xlsx and the configuration file can be compressed, and the compressed file is converted into the extension of the xlsx document. The xlsx file is a compressed file in a zip format, and after the suffix of the xlsx file is changed to be zip, after decompression, a plurality of files can be seen, wherein the configuration file is used for controlling the style (including fonts, frames and cell style information) of the document and the abstract information (such as title, author, editing time and the like) of the document, and the resource file is mainly used for storing data in a table, such as images in the table and characters in the table. Fig. 6 is a schematic diagram of an alternative conversion result.
It should be noted that, for simplicity of description, the foregoing method embodiments are all described as a series of acts, but it should be understood by those skilled in the art that the present invention is not limited by the order of acts described, as some steps may be performed in other orders or concurrently in accordance with the present invention. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required for the present invention.
According to another aspect of the embodiments of the present application, there is further provided a table extracting apparatus, as shown in fig. 7, including:
the identifying module 702 is configured to identify an original table to be extracted in the image, and extract a cell structure of the original table, where content in the original table is not editable;
an extracting module 704, configured to extract text attributes in the original table;
a conversion module 706 for converting the cell structure and text attributes into a hypertext markup language description;
the parsing module 708 is configured to parse the hypertext markup language description to obtain a target table, where contents in the target table are editable.
Alternatively, the image in this embodiment may be a picture, a photograph, or a file in a portable document (Portable Document Format, PDF) format. The image contains an original table to be extracted, and the content in the original table is not editable. Note that, the non-editable content mentioned in this embodiment means that the content in the image is not changed, i.e., cannot be modified, added, and deleted.
Alternatively, the present embodiment may be applied in a process of identifying a form in an image. One or more tables may be included in the image, and a portion of the table or the entire table may be included. The table in the image cannot be edited. By adopting the method, the table in the image is extracted as the editable target table.
In the embodiment of the invention, when the table in the image is identified, the cell structure of the table can be extracted firstly, then the text attribute of the table is extracted, then the cell structure and the text attribute are converted into the hypertext markup language description, when the cell structure is extracted, the purpose is to record the position and the relation of the cells in the table, and when the text attribute is extracted, the purpose is to record the text in the table and the relation of the text and the cells. Then converting the cell structure and the text attribute into a hypertext markup language description, and analyzing the hypertext markup language description after converting the hypertext markup language description into the hypertext markup language description to obtain the editable target form. The target table may be stored in a format of a table file and may be stored in a format of a compressed package.
According to the method, the target table which can be marked can be obtained according to the analysis of the non-editable table in the image, and the user can edit the target table or use the data in the target table, so that the purpose of converting the table data in the image into the editable table data is achieved.
For other examples of this embodiment, please refer to the above examples, and are not described herein.
Fig. 8 is a block diagram of an alternative electronic device, according to an embodiment of the present application, including a processor 802, a communication interface 804, a memory 806, and a communication bus 808, as shown in fig. 8, wherein the processor 802, the communication interface 804, and the memory 806 communicate with each other via the communication bus 808, wherein,
a memory 806 for storing a computer program;
the processor 802, when executing the computer program stored on the memory 806, performs the following steps:
identifying an original table to be extracted in the image, and extracting a cell structure of the original table, wherein the content in the original table is not editable;
extracting text attributes in the original form;
converting the cell structure and text attributes into a hypertext markup language description;
and analyzing the hypertext markup language description to obtain a target table, wherein the content in the target table can be edited.
Alternatively, in the present embodiment, the above-described communication bus may be a PCI (Peripheral Component Interconnect, peripheral component interconnect standard) bus, or an EISA (Extended Industry Standard Architecture ) bus, or the like. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, only one thick line is shown in fig. 8, but not only one bus or one type of bus. The communication interface is used for communication between the electronic device and other devices.
The memory may include RAM or may include non-volatile memory (non-volatile memory), such as at least one disk memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.
As an example, the memory 806 may include, but is not limited to, an identification module 702, an extraction module 704, a conversion module 706, and a parsing module 708 in an extraction device that includes the table. In addition, other module units in the extraction device of the table may be included, but are not limited to, and are not described in detail in this example.
The processor may be a general purpose processor and may include, but is not limited to: CPU (Central Processing Unit ), NP (Network Processor, network processor), etc.; but also DSP (Digital Signal Processing, digital signal processor), ASIC (Application Specific Integrated Circuit ), FPGA (Field-Programmable Gate Array, field programmable gate array) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components.
Alternatively, specific examples in this embodiment may refer to examples described in the foregoing embodiments, and this embodiment is not described herein.
It will be understood by those skilled in the art that the structure shown in fig. 8 is only schematic, and the device implementing the method for extracting the table may be a terminal device, and the terminal device may be a smart phone (such as an Android mobile phone, an iOS mobile phone, etc.), a tablet computer, a palm computer, a mobile internet device (Mobile Internet Devices, MID), a PAD, etc. Fig. 8 is not limited to the structure of the electronic device described above. For example, the electronic device may also include more or fewer components (e.g., network interfaces, display devices, etc.) than shown in FIG. 8, or have a different configuration than shown in FIG. 8.
Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of the above embodiments may be implemented by a program for instructing a terminal device to execute in association with hardware, the program may be stored in a computer readable storage medium, and the storage medium may include: flash disk, ROM, RAM, magnetic or optical disk, etc.
According to a further aspect of embodiments of the present invention, there is also provided a computer-readable storage medium having a computer program stored therein, wherein the computer program, when executed by a processor, performs the steps in the above-described table extraction method.
Alternatively, in this embodiment, it will be understood by those skilled in the art that all or part of the steps in the methods of the above embodiments may be performed by a program for instructing a terminal device to execute the steps, where the program may be stored in a computer readable storage medium, and the storage medium may include: flash disk, read-Only Memory (ROM), random-access Memory (Random Access Memory, RAM), magnetic or optical disk, and the like.
The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.
The integrated units in the above embodiments may be stored in the above-described computer-readable storage medium if implemented in the form of software functional units and sold or used as separate products. Based on such understanding, the technical solution of the present invention may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, comprising several instructions for causing one or more computer devices (which may be personal computers, servers or network devices, etc.) to perform all or part of the steps of the method described in the embodiments of the present invention.
In the foregoing embodiments of the present invention, the descriptions of the embodiments are emphasized, and for a portion of this disclosure that is not described in detail in this embodiment, reference is made to the related descriptions of other embodiments.
In several embodiments provided in the present application, it should be understood that the disclosed client may be implemented in other manners. The above-described embodiments of the apparatus are merely exemplary, and the division of the units, such as the division of the units, is merely a logical function division, and may be implemented in another manner, for example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interfaces, units or modules, or may be in electrical or other forms.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention, which are intended to be comprehended within the scope of the present invention.

Claims (10)

1. A method for extracting a table, comprising:
identifying an original table to be extracted in an image, and extracting a cell structure of the original table, wherein the content in the original table is not editable;
extracting text attributes in the original table;
converting the cell structure and the text attribute into a hypertext markup language description;
and analyzing the hypertext markup language description to obtain a target table, wherein the content in the target table can be edited.
2. The method of claim 1, wherein the identifying the original form in the image to be extracted, extracting the cell structure of the original form comprises:
identifying each table line of the original table in the image;
combining the identified table lines to obtain the relationship between the position of each cell in the original table and the cells in the original table;
and taking the relation between the position of each cell in the original table and the cell in the original table as a cell structure of the original table.
3. The method of claim 2, wherein the identifying each table line of the original table in the image comprises:
identifying each straight line in the image to obtain a plurality of straight lines;
and selecting a rectangle with the largest area from the rectangles formed by the plurality of straight lines, and taking the side of the rectangle with the largest area and the straight line surrounded by the rectangle with the largest area as the table line.
4. The method of claim 1, wherein the extracting text attributes in the original form comprises:
Identifying each character in the image, and taking the identified character as text content of the original table when the character is identified and the character is located in the original table;
determining a target cell where each text content is located;
identifying a location of each of the text content in the target cell;
and taking the text content, the target cell where the text content is located and the position of the text content in the target cell as the text attribute of the original table.
5. The method of claim 4, wherein said identifying the location of each of said text content in said target cell comprises:
taking each text content as the current text content, and executing the following operations on the current text content:
identifying a distance between the current text content and a boundary of the target cell;
and determining the position of the current text content in the target cell according to the distance between the current text content and the boundary of the target cell.
6. The method of claim 1, wherein said converting said cell structure and said text attributes into a hypertext markup language description comprises:
Inserting the text content in the text attribute into the cell structure according to a target cell where the text content indicated by the text attribute is located and the position of the text content in the target cell, so as to obtain a data structure comprising the text content and the cell structure;
reading the content of the data structure according to the line, and converting the content of the data structure into the hypertext markup language description.
7. The method of claim 6, wherein reading the contents of the data structure in rows, converting the contents of the data structure to the hypertext markup language description comprises:
calculating the number of rows or columns occupied by the merged cells under the condition that the merged cells exist in the data structure;
in the process of sequentially reading the cells of each row, under the condition of reading the merged cells for the first time, converting the content of the data structure of the merged cells into the hypertext markup language description;
and skipping the merged cell according to the number of rows or the number of columns when the merged cell is read again.
8. The method of any of claims 1 to 7, wherein said parsing the hypertext markup language description to obtain a target form comprises:
creating an editable form file;
and describing the content filled in the original table in the editable table file according to the hypertext markup language to obtain the target table.
9. The method of claim 8, wherein the populating the original form in the editable form file according to the hypertext markup language description to obtain the target form comprises:
reading code content in the hypertext markup language description line by line;
under the condition that a preset code is read, acquiring an object code after the preset code, wherein the object code records table information of the object table;
and writing the target table in the editable table file according to the table information recorded in the target code.
10. A form extraction device, comprising:
the identification module is used for identifying an original table to be extracted in the image, and extracting a cell structure of the original table, wherein the content in the original table is not editable;
The extraction module is used for extracting text attributes in the original form;
the conversion module is used for converting the cell structure and the text attribute into a hypertext markup language description;
and the analysis module is used for analyzing the hypertext markup language description to obtain a target table, wherein the content in the target table can be edited.
CN202210692042.0A 2022-06-17 2022-06-17 Table extraction method and apparatus Pending CN117291152A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210692042.0A CN117291152A (en) 2022-06-17 2022-06-17 Table extraction method and apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210692042.0A CN117291152A (en) 2022-06-17 2022-06-17 Table extraction method and apparatus

Publications (1)

Publication Number Publication Date
CN117291152A true CN117291152A (en) 2023-12-26

Family

ID=89250532

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210692042.0A Pending CN117291152A (en) 2022-06-17 2022-06-17 Table extraction method and apparatus

Country Status (1)

Country Link
CN (1) CN117291152A (en)

Similar Documents

Publication Publication Date Title
CN109933756B (en) Image file transferring method, device and equipment based on OCR (optical character recognition), and readable storage medium
CN109062874B (en) Financial data acquisition method, terminal device and medium
CN102117269B (en) Apparatus and method for digitizing documents
US6336124B1 (en) Conversion data representing a document to other formats for manipulation and display
CN110751143A (en) Electronic invoice information extraction method and electronic equipment
CN110879937A (en) Method and device for generating webpage from document, computer equipment and storage medium
US20120324341A1 (en) Detection and extraction of elements constituting images in unstructured document files
US20060285746A1 (en) Computer assisted document analysis
US11615635B2 (en) Heuristic method for analyzing content of an electronic document
CN107689070B (en) Chart data structured extraction method, electronic device and computer-readable storage medium
US8522138B2 (en) Content analysis apparatus and method
CN111797630B (en) PDF-format-paper-oriented biomedical entity identification method
CN112069991A (en) PDF table information extraction method and related device
JP6795195B2 (en) Character type estimation system, character type estimation method, and character type estimation program
CN109271598B (en) Method, device and storage medium for extracting news webpage content
US9535880B2 (en) Method and apparatus for preserving fidelity of bounded rich text appearance by maintaining reflow when converting between interactive and flat documents across different environments
CN109582934B (en) Format document conversion method and device
JP5950700B2 (en) Image processing apparatus, image processing method, and program
CN111695414B (en) Document processing method and device, electronic equipment and computer readable storage medium
CN116384344A (en) Document conversion method, device and storage medium
CN117291152A (en) Table extraction method and apparatus
CN113297425B (en) Document conversion method, device, server and storage medium
CN112699634B (en) Typesetting processing method of electronic book, electronic equipment and storage medium
CN111126007B (en) HTM L-based medical record document paging algorithm
CN112965772A (en) Web page display method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination