CN113762158A - Borderless table recovery model training method, device, computer equipment and medium - Google Patents

Borderless table recovery model training method, device, computer equipment and medium Download PDF

Info

Publication number
CN113762158A
CN113762158A CN202111049186.6A CN202111049186A CN113762158A CN 113762158 A CN113762158 A CN 113762158A CN 202111049186 A CN202111049186 A CN 202111049186A CN 113762158 A CN113762158 A CN 113762158A
Authority
CN
China
Prior art keywords
image
text
borderless
training data
frame
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111049186.6A
Other languages
Chinese (zh)
Inventor
张可昕
高寒冰
李果夫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Asset Management Co Ltd
Original Assignee
Ping An Asset Management Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Asset Management Co Ltd filed Critical Ping An Asset Management Co Ltd
Priority to CN202111049186.6A priority Critical patent/CN113762158A/en
Publication of CN113762158A publication Critical patent/CN113762158A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/177Editing, e.g. inserting or deleting of tables; using ruled lines

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Character Input (AREA)

Abstract

The application relates to the technical field of industrial intelligence, in particular to a frameless table recovery model training method, a frameless table recovery model training device, computer equipment and a medium. The method comprises the following steps: acquiring original training data, wherein the original training data comprises a form image with a frame form in text data and text structure information; identifying the table border lines in the table image, preprocessing each identified table border line, and generating a borderless table image of the table image corresponding to the table with the border; generating target training data according to the borderless form image and the corresponding text structure information of the framed form; and training the constructed initial recovery model based on the target training data to obtain a trained frameless table recovery model. By adopting the method, the efficiency of the table recovery processing of the table without the table frame can be improved. The application also relates to the field of blockchain technology, where each data can be uploaded to a blockchain.

Description

Borderless table recovery model training method, device, computer equipment and medium
Technical Field
The present application relates to the field of artificial intelligence technologies, and in particular, to a borderless table recovery model training method, apparatus, computer device, and medium.
Background
Form content restoration and extraction is a task often addressed by the financial management industry during the technological transformation period, and the task is to restore and reproduce forms in document files according to original contents and styles of the forms. At present, the annual audit reports and the collection specifications and other reports disclosed by large enterprises at the present stage all contain a large amount of form data with various formats, including frame forms and frame-free forms.
In a traditional mode, a framed form follows common data type scenes, the header and the structure are relatively unified, and recovery is simpler. For the borderless table, because the forms of the different objects (such as audit companies) are different, and the fixed header and the structure are not provided, and the unified standard is not provided for restoration, the restoration difficulty of the borderless table is higher, and the restoration processing efficiency of the borderless table is lower.
Disclosure of Invention
In view of the foregoing, it is desirable to provide a borderless table recovery model training method, apparatus, computer device, and medium capable of improving efficiency of processing for recovering a borderless table.
A borderless table recovery model training method, the method comprising:
acquiring original training data, wherein the original training data comprises a form image with a frame form in text data and text structure information;
identifying the table border lines in the table image, preprocessing each identified table border line, and generating a borderless table image of the table image corresponding to the table with the border;
generating target training data according to the borderless form image and the corresponding text structure information of the framed form;
and training the constructed initial recovery model based on the target training data to obtain a trained frameless table recovery model.
In one embodiment, identifying the table border lines in the table image, and preprocessing each identified table border line to generate a borderless table image of the table image corresponding to the table with borders, includes:
identifying each pixel point in the table image, and determining each table frame line in the table image;
determining adjacent cells corresponding to each table frame line based on each table frame line;
obtaining cell pixel values of cell pixel points in each adjacent cell;
and based on the pixel value of each cell, carrying out pixel value replacement on the frame line pixel point of each table frame line to generate a corresponding frameless table image.
In one embodiment, after determining each table border line in the table image, the method further includes:
judging whether the table frame lines have extension relation or not;
when the extension relationship exists between the table frame lines, determining that at least two table frame lines with the extension relationship exist as the same table frame line;
based on each cell pixel value, carrying out pixel value replacement on the frame line pixel points of each table frame line to generate a corresponding frameless table image, comprising:
and carrying out pixel value replacement on the frame line pixel points of at least two table frame lines determined as the same table frame line based on the cell pixel values of the adjacent cells corresponding to any one of the at least two table frame lines with the extension relation, so as to generate a corresponding frameless table image.
In one embodiment, performing border line pixel value replacement on each table border line based on each cell pixel value to generate a corresponding borderless table image, includes:
and randomly determining a target frame line from the table frame lines determined by the table image, and replacing the pixel values of the pixel points of each frame line of the target frame line based on the cell pixel values of the pixel points of the cell in the adjacent cells of the target frame line to obtain a frameless table image.
In one embodiment, obtaining raw training data comprises:
acquiring original text data, wherein the original text data comprises a frame table;
splitting each original text data according to the page number of the file to obtain each text page;
recognizing text titles, text contents and text tables of the text pages, and determining coordinate positions corresponding to the text titles, the text contents and the text tables;
based on each coordinate position, establishing text structure information corresponding to the original text data, and cutting form images from each text page;
based on the tabular images and the text structure information, original training data is generated.
In one embodiment, the original training data further includes the number of cells of the framed table in the table image and position index data between the cells;
generating target training data according to the borderless form image and the text structure information of the corresponding framed form, including:
generating target training data based on the borderless table image, the text structure information, the number of cells of the borderless table and the position index data among the cells;
training the constructed initial recovery model based on target training data to obtain a trained frameless table recovery model, comprising:
and training the constructed initial recovery model through target training data to obtain a trained frameless table recovery model.
In one embodiment, the method further includes:
uploading at least one of the form image, the text structure information, the borderless form image and the target training data to a blockchain node for storage.
A borderless table recovery model training apparatus, the apparatus comprising:
the system comprises an original training data acquisition module, a frame table generation module and a frame table generation module, wherein the original training data acquisition module is used for acquiring original training data which comprises a table image with a frame table in text data and text structure information;
the borderless table image generation module is used for identifying the table border lines in the table image, preprocessing each identified table border line and generating a borderless table image corresponding to the table image with the border table;
the target training data generation module is used for generating target training data according to the borderless form image and the text structure information of the corresponding framed form;
and the training module is used for training the constructed initial recovery model based on the target training data to obtain a trained frameless table recovery model.
A computer device comprising a memory storing a computer program and a processor implementing the steps of the method of any of the above embodiments when the processor executes the computer program.
A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any of the above embodiments.
The borderless table recovery model training method, apparatus, computer device, and medium described above, by obtaining the borderless table as initial training data, and generates a corresponding drawing frame table image by identifying and preprocessing the table edge lines of the table image with the frame table, so that the data size and table format enhancement of the borderless table can be realized by the framed tables of different types and large data sizes, namely, the target training data after data enhancement is obtained, so that when the borderless table recovery model training is carried out based on the target training data, the quantity of training data used for training the model and the type of the table format can be improved, the accuracy of the trained borderless table recovery model can be improved, and then the accuracy of performing borderless table restoration on a subsequent borderless table restoration model based on training can be improved. And the borderless form image is combined with the text structure information and used for training the model, so that the training process of the model is combined with the text structure information of the text data, the accuracy of form recognition can be improved, and the accuracy of restoration can be improved.
Drawings
FIG. 1 is a diagram illustrating an exemplary scenario for an exemplary method for training a borderless table recovery model;
FIG. 2 is a flowchart illustrating a method for training a borderless table recovery model according to an embodiment;
FIG. 3 is a diagram of a framed table in one embodiment;
FIG. 4 is a diagram of a borderless table in one embodiment;
FIG. 5 is a block diagram illustrating an exemplary embodiment of a device for training a borderless table recovery model;
FIG. 6 is a diagram illustrating an internal structure of a computer device according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
The borderless table recovery model training method provided by the application can be applied to the application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The terminal 102 may interact with the user and perform subsequent data processing based on the user's instructions. Specifically, the server 104, after receiving the indication of the terminal 102, may obtain the original training data, which includes the tabular image with the frame table in the text data and the text structure information. The server 104 may then identify the bezel border lines in the form image and pre-process each identified form frame line to generate a borderless form image corresponding to the form image of the framed form. Further, the server 104 may generate target training data according to the borderless form image and the text structure information of the corresponding framed form, and train the constructed initial recovery model based on the target training data to obtain a trained borderless form recovery model. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, and the server 104 may be implemented by an independent server or a server cluster formed by a plurality of servers.
In one embodiment, as shown in fig. 2, a borderless table recovery model training method is provided, which is described by taking the method as an example applied to the server in fig. 1, and includes the following steps:
step S202, original training data are obtained, and the original training data comprise table images with frame tables in the text data and text structure information.
In the financial field, the statistical analysis of the financial status and the like of each large enterprise is generally performed based on the data provided by each large enterprise or company, and specifically, the data provided by each company is restored and stored, that is, processed into standard data in the same format, and stored so as to facilitate the subsequent data analysis.
Data (e.g., text data) provided by each company usually includes a large amount of table data, including a frame table and a frame-less table, and for the frame table, the header structure can be used for restoration processing, so that the processing process is simple. And for borderless tables, the processing difficulty is high.
In this embodiment, the server may obtain the original training data based on the obtained data provided by each enterprise, and obtain a framed form image and text structure information in the text data, such as a text title, a text hierarchy, a hierarchical relationship between a form and text content, from the data provided by each enterprise.
In this embodiment, based on different enterprises, the styles of the tables in the data provided by the enterprises may be different, and the framed tables in the obtained table images may be different, so that the original training data obtained by the server may include framed table images corresponding to a plurality of different styles and corresponding text structure information.
Step S204, identifying the table frame lines in the table image, preprocessing each identified table frame line, and generating a borderless table image of the table image corresponding to the table with the frame.
Specifically, after the server acquires the original training data, the server may identify and detect the table frame lines in each table image in the original training data.
Specifically, the server may identify each table border line in each table image through an OCR (Optical Character Recognition) algorithm, another algorithm, or the like.
In this embodiment, after determining each table border line in the table image, the server may perform preprocessing on the table border line based on the table image, for example, removing part or all of the table border line, etc., to generate a borderless table image.
Those skilled in the art will appreciate that the border lines of a table in a borderless table image may be partially or fully absent, and does not merely refer to the absence of all border lines.
In this embodiment, because the framed tables with large data size and different types exist, the borderless tables with large data size and different types can be generated based on the framed tables, so that the generation quantity and quality of the borderless tables can be improved, and the data enhancement of the borderless table image can be realized.
And S206, generating target training data according to the borderless form image and the text structure information of the corresponding framed form.
In this embodiment, after obtaining the borderless form image of the frame form in the original training data, the server may combine the borderless form image with the corresponding text structure information of the present data to generate the target training data, and use the target training data for the subsequent training of the initial recovery model.
And S208, training the constructed initial recovery model based on the target training data to obtain a trained frameless table recovery model.
The initial recovery model may be a neural network model based on machine learning, for example, a Convolutional Neural Network (CNN) or a Deep Convolutional Neural Network (DCNN), a Generative Adaptive Network (GAN), a Recurrent Neural Network (RNN), a Long Short Term Memory (LSTM), and the like, which is not limited in this application.
In this embodiment, the server may construct an initial recovery model based on the user instruction, and train the initial recovery model through the target training data.
Specifically, the server may set training parameters of the model, and perform training of the initial recovery model based on the training parameters, such as training times, learning rate, passing rate, and the like of the model.
In this embodiment, the server may label the target training data, for example, label a text block, and perform training of the initial recovery model based on the labeled target training data.
In this embodiment, the server may divide the target training data into a training data set and a test data set, where the training test set is used for training the model, and the test training set is used for testing the model, and when both the training and the testing pass, a trained borderless table recovery model is obtained.
In this embodiment, the server may perform restoration processing on the text data to be restored through the trained borderless table restoration model, and the text data to be restored may include a borderless table.
Specifically, the server may input the text data to be restored into the borderless table restoration model, identify the structure of each content in the text data to be restored through the table restoration model, and generate corresponding data in a unified standard format based on the borderless table text content in the text data to be restored, so that the subsequent server may perform data analysis based on the generated data in the standard format and generate a corresponding report, such as an annual audit report or an annual assessment report.
In the method for training the frameless table recovery model, the framed table is obtained as initial training data, and the table edge lines of the table image of the framed table are identified and preprocessed to generate the corresponding graph frame table image, so that the data volume and the table format of the frameless table can be enhanced through framed tables of different types and large data volumes, and the target training data after data enhancement is obtained. And the borderless form image is combined with the text structure information and used for training the model, so that the training process of the model is combined with the text structure information of the text data, the accuracy of form recognition can be improved, and the accuracy of restoration can be improved.
In one embodiment, identifying the table border lines in the table image, and preprocessing each identified table border line to generate a borderless table image of the table image corresponding to the table with the border, may include: identifying each pixel point in the table image, and determining each table frame line in the table image; determining adjacent cells corresponding to each table frame line based on each table frame line; obtaining cell pixel values of cell pixel points in each adjacent cell; and based on the pixel value of each cell, carrying out pixel value replacement on the frame line pixel point of each table frame line to generate a corresponding frameless table image.
In this embodiment, after the server acquires the form image, the server may traverse each pixel point in each form image, and determine a form frame line in the form image through the pixel point.
Specifically, the server may determine whether the pixel points are the table frame line according to the corresponding relationship between the pixel points with the same pixel value, for example, referring to fig. 3, if the pixel values of a preset number (e.g., N) of the pixel points are not 0 in the horizontal or vertical direction, it may determine that the plurality of pixel points are the pixel points corresponding to the table frame line, so as to determine the table frame line. Or in the horizontal or longitudinal direction, the pixel points with the pixel value not being 0 form a plurality of rectangular areas, and the plurality of rectangular areas are connected to form a closed loop, and then the pixel points with the pixel value not being 0 can be determined as the pixel points corresponding to the table frame line, and if the rectangular areas 1, 2, 3, 4 formed by taking the pixel points as units are connected end to form a closed loop, then the rectangular areas 1, 2, 3, 4 formed by taking the pixel points as units are respectively the corresponding table frame lines.
In this embodiment, the server may determine the adjacent cells corresponding to each table border line after determining each table border line, e.g., for table border line 1 and table border line 2, the server may determine the adjacent cells corresponding thereto as a, for table border line 3, the adjacent cells corresponding thereto as a and B, and for table border line 4, the adjacent cells corresponding thereto as a and C.
In this embodiment, for the table frame line 1 and the table frame line 2, the server may obtain the cell pixel values of the cell pixel points of the adjacent cell a, and use the cell pixel values to replace the frame line pixel values of the frame line pixel points of the table frame line 1 and the table frame line 2. For the table frame line 3, the server may replace the cell pixel value by acquiring the cell pixel point of the adjacent cell a or the adjacent cell B. Similarly, for the table frame line 4, the server may replace the cell pixel value by obtaining the cell pixel point of the adjacent cell a or the adjacent cell C.
In this embodiment, the server may replace pixel values of pixels in a part of or all of the frame lines of the framed table to generate a borderless table corresponding to the framed table, that is, a borderless table image corresponding to the table image.
In the above embodiment, each pixel point is identified to determine the table frame line and the adjacent cells, and then the pixel values of the frame line pixel points of the corresponding table frame line are replaced by the cell pixel values of the cell pixel points of the adjacent cells, so that after the table frame line is removed, the pixel values of the replacement positions are consistent with the cell pixel values of the adjacent cells, and the accuracy of the generated borderless table can be improved.
In one embodiment, after determining each table frame line in the table image, the method may further include: judging whether the table frame lines have extension relation or not; when it is determined that an extension relationship exists between the table frame lines, it is determined that at least two table frame lines having an extension relationship exist are the same table frame line.
In this embodiment, the server may determine whether an extending relationship exists between the form frame lines based on the orientation of the form frame lines and the coordinate positions. For example, with continued reference to FIG. 3, for a table border line 1 and a table border line 5 that are both vertically oriented and coordinate positions are consistent in a lateral direction, the server may determine that an extending relationship exists between the table border line 1 and the table border line 5. Similarly, the table frame line 2, the table frame line 6, and the table frame line 7 are oriented in the same direction, and the vertical coordinates are in the same direction, it can be determined that the table frame line 2, the table frame line 6, and the table frame line 7 are in an extending relationship. In contrast, in the case of the table frame line 2 and the table frame line 4, or the table frame line 3 and the table frame line 1, or the table frame line 3 and the table frame line 5, although their orientations are identical, but their coordinate positions in the orientation direction do not coincide, there is no extending relationship between the table frame line 2 and the table frame line 4, or the table frame line 3 and the table frame line 1, or the table frame line 3 and the table frame line 5.
Further, the server may determine that at least two table frame lines having an extension relationship are the same table frame line, that is, may determine that the table frame line 1 and the table frame line 5 are the same table frame line, and determine that the table frame line 2, the table frame line 6, and the table frame line 7 are the same table frame line.
In this embodiment, based on the pixel value of each cell, performing pixel value replacement on the frame line pixel point of each table frame line to generate a corresponding frameless table image, which may include: and carrying out pixel value replacement on the frame line pixel points of at least two table frame lines determined as the same table frame line based on the cell pixel values of the adjacent cells corresponding to any one of the at least two table frame lines with the extension relation, so as to generate a corresponding frameless table image.
In this embodiment, the server may perform pixel value replacement on the frame line pixel points of at least two table frame lines determined as the same table frame line according to the cell pixel values of the adjacent cells corresponding to any one of the at least two table frame lines having an extension relationship. For example, with continued reference to fig. 3, for the table border line 1 and the table border line 5, the server may replace the table border line 1 and the table border line 5 simultaneously based on the cell pixel value of the cell pixel point of the adjacent cell a corresponding to the table border line 1 or the cell pixel value of the cell pixel point of the adjacent cell C corresponding to the table border line 5.
In one embodiment, the server may determine which adjacent cell pixel value is selected to replace the border line pixel value of the table border line by comparing the cell pixel values of the cell pixel points of the adjacent cells, e.g., the server may determine that a pixel value with a smaller pixel value is used to replace the table border line.
In one embodiment, performing border line pixel value replacement on each table border line based on each cell pixel value to generate a corresponding borderless table image may include: and randomly determining a target frame line from the table frame lines determined by the table image, and replacing the pixel values of the pixel points of each frame line of the target frame line based on the cell pixel values of the pixel points of the cell in the adjacent cells of the target frame line to obtain a frameless table image.
Specifically, when the server performs pixel value replacement on the table frame line based on the cell pixel values, a random number of table frame lines can be randomly determined as the target frame line from the determined table frame lines through a random algorithm, and the pixel values of the pixel points of each frame line of the target frame line are replaced according to the cell pixel values of the pixel points of the cell in the adjacent cell of the target frame line, so as to obtain the frameless table image.
For example, with continued reference to fig. 3, the server, through a random algorithm, may determine to replace only the table border line 3, or only the table border line 1 and the table border line 5, or to replace all of the table border lines, and the replaced borderless table may be as shown in (a) or (b) of fig. 4 below.
In the embodiment, by randomly replacing part or all of the table border lines, the borderless table with various types can be generated, that is, the borderless table image with more various types can be generated, so that the data enhancement of the borderless table image can be further realized, and the accuracy of the subsequent model training can be improved.
In one embodiment, obtaining raw training data may include: acquiring original text data, wherein the original text data comprises a frame table; splitting each original text data according to the page number of the file to obtain each text page; recognizing text titles, text contents and text tables of the text pages, and determining coordinate positions corresponding to the text titles, the text contents and the text tables; based on each coordinate position, establishing text structure information corresponding to the original text data, and cutting form images from each text page; based on the tabular images and the text structure information, original training data is generated.
The text data may refer to text data in various different formats, and may be Word or PDF document, for example.
In this embodiment, after the server obtains the text data, the server may split the text data according to the page number to generate corresponding text pages.
Further, the server may identify and locate each text page via an OCR recognition algorithm to determine each text title, text content, and table in each text page.
In this embodiment, the text titles may include multi-level titles, such as a first-level title, a second-level title, a third-level title, and the like, which is not limited in this application. The text content refers to pure text content, such as a whole segment of text content. The tables are framed tables or may include unframed tables.
In this embodiment, the server may establish text mechanism information, such as how many levels of titles, under which levels of titles the text content and the table are respectively located, based on each determined text information.
In this embodiment, the server may crop the form image from the text page according to the coordinate position of the form data. For example, the server may extend the margins of the preset pixels along the table extension and crop to obtain the table image.
In this embodiment, the server may perform associated storage on the table image and the text structure information to obtain corresponding original training data.
In one embodiment, the raw training data may further include the number of cells of the framed table in the table image and position index data between the cells.
In this embodiment, the original training data acquired by the server may further include the number of cells in the frame table and the position index input between the cells, and refer to fig. 3 continuously, the original training data may include 5 cells, the cell a, the cell B, and the cell D are horizontally adjacent to each other, the cell C and the cell E are horizontally adjacent to each other, the cell a, the cell B, and the cell D are vertically adjacent to the cell C and the cell E, and the cell C and the cell E correspond to the cell B.
In this embodiment, the position index data between the cells may also refer to relative coordinate position data, such as upper left coordinates and lower right coordinates of each cell, coordinates of a center point of a collaborative unit, a long coordinate, and the like, which is not limited in this application.
In this embodiment, generating target training data according to the borderless form image and the text structure information of the corresponding framed form may include: target training data is generated based on the borderless table image, the text structure information, the number of cells of the borderless table, and the position index data between the cells.
In this embodiment, training the constructed initial recovery model based on the target training data to obtain a trained borderless table recovery model, which may include: and training the constructed initial recovery model through target training data to obtain a trained frameless table recovery model.
Further, the server may use the borderless table image, the text structure information, the number of cells of the borderless table, and the position index data between the cells as target training data and use them for training of the initial restoration model.
In one embodiment, the method may further include: uploading at least one of the form image, the text structure information, the borderless form image and the target training data to a blockchain node for storage.
The blockchain refers to a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A Block chain (Block chain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data Block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next Block.
Specifically, the blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
In this embodiment, the server may upload and store one or more data of the form image, the text structure information, the borderless form image, and the target training data in a node of the block chain, so as to ensure privacy and security of the data.
In the above embodiment, at least one of the form image, the text structure information, the borderless form image, and the target training data is uploaded to the block chain and stored in the node of the block chain, so that the privacy of the data stored in the block chain link point can be ensured, and the security of the data can be improved.
In one embodiment, the data processing process may be based on artificial intelligence technology to acquire and process the related data. Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result.
The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.
It should be understood that, although the steps in the flowchart of fig. 2 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in fig. 2 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
In one embodiment, as shown in fig. 5, there is provided a borderless table recovery model training apparatus, including: module A, module B and module C, wherein:
the original training data obtaining module 100 is configured to obtain original training data, where the original training data includes a table image with a frame table in text data and text structure information.
The borderless table image generation module 200 is configured to identify the table border lines in the table image, and pre-process the identified table border lines to generate a borderless table image corresponding to the table image of the table with borders.
And a target training data generating module 300, configured to generate target training data according to the borderless form image and the text structure information of the corresponding framed form.
And the training module 400 is configured to train the constructed initial recovery model based on the target training data to obtain a trained borderless table recovery model.
In one embodiment, the borderless table image generation module 200 may include:
and the table frame line determining submodule is used for identifying each pixel point in the table image and determining each table frame line in the table image.
And the adjacent cell determining submodule is used for determining the adjacent cells corresponding to the table frame lines based on the table frame lines.
And the cell pixel value acquisition submodule is used for acquiring the cell pixel values of the cell pixel points in each adjacent cell.
And the borderless table image generation submodule is used for carrying out pixel value replacement on the frame line pixel points of the frame lines of each table based on the pixel values of each cell to generate a corresponding borderless table image.
In one embodiment, the apparatus may further include:
the judging module is used for judging whether the table frame lines have an extension relation after determining the table frame lines in the table image; when it is determined that an extension relationship exists between the table frame lines, it is determined that at least two table frame lines having an extension relationship exist are the same table frame line.
In this embodiment, the borderless table image generation submodule is configured to, based on a cell pixel value of an adjacent cell corresponding to any one of at least two table frame lines having an extension relationship, perform pixel value replacement on frame line pixel points of at least two table frame lines determined as the same table frame line, and generate a corresponding borderless table image.
In one embodiment, the borderless table image generation submodule is configured to randomly determine a target frame line from the table frame lines determined by the table image, and replace pixel values of pixel points of each frame line of the target frame line based on cell pixel values of pixel points of cells in adjacent cells of the target frame line, so as to obtain the borderless table image.
In one embodiment, the raw training data obtaining module 100 may include:
and the original text data acquisition submodule is used for acquiring original text data, and the original text data comprises a frame table.
And the text page generation submodule is used for splitting each original text data according to the page number of the file to obtain each text page.
And the determining submodule is used for identifying the text title, the text content and the text table of each text page and determining the coordinate position corresponding to each text title, each text content and each text table.
And the cutting sub-module is used for establishing text structure information corresponding to the original text data based on each coordinate position and cutting the form image from each text page.
And the original training data generation submodule is used for generating original training data based on the tabular image and the text structure information.
In one embodiment, the raw training data may further include the number of cells of the framed table in the table image and position index data between the cells.
In this embodiment, the target training data generating module 300 is configured to generate target training data based on the borderless table image, the text structure information, the number of cells of the framed table, and the position index data between the cells.
In this embodiment, the training module 400 is configured to train the constructed initial recovery model through the target training data, so as to obtain a trained borderless table recovery model.
In one embodiment, the apparatus may further include:
and the storage module is used for uploading at least one of the table image, the text structure information, the borderless table image and the target training data to the block chain node for storage.
For specific limitations of the borderless table recovery model training device, reference may be made to the above limitations of the borderless table recovery model training method, which will not be described herein again. All or part of the modules in the borderless table recovery model training device can be realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 6. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing data such as table images, text structure information, borderless table images, target training data and the like. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a borderless table recovery model training method.
Those skilled in the art will appreciate that the architecture shown in fig. 6 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, there is provided a computer device comprising a memory storing a computer program and a processor implementing the following steps when the processor executes the computer program: acquiring original training data, wherein the original training data comprises a form image with a frame form in text data and text structure information; identifying the table border lines in the table image, preprocessing each identified table border line, and generating a borderless table image of the table image corresponding to the table with the border; generating target training data according to the borderless form image and the corresponding text structure information of the framed form; and training the constructed initial recovery model based on the target training data to obtain a trained frameless table recovery model.
In one embodiment, the processor, when executing the computer program, implements identifying the frame lines in the table image, and pre-processes each identified frame line in the table image to generate a borderless table image corresponding to the table image of the framed table, and may include: identifying each pixel point in the table image, and determining each table frame line in the table image; determining adjacent cells corresponding to each table frame line based on each table frame line; obtaining cell pixel values of cell pixel points in each adjacent cell; and based on the pixel value of each cell, carrying out pixel value replacement on the frame line pixel point of each table frame line to generate a corresponding frameless table image.
In one embodiment, after the processor executes the computer program to determine the table border lines in the table image, the following steps can be implemented: judging whether the table frame lines have extension relation or not; when it is determined that an extension relationship exists between the table frame lines, it is determined that at least two table frame lines having an extension relationship exist are the same table frame line.
In this embodiment, when the processor executes the computer program, the implementing, based on the pixel value of each cell, the pixel value replacement of the frame line pixel point of each table frame line to generate the corresponding frameless table image may include: and carrying out pixel value replacement on the frame line pixel points of at least two table frame lines determined as the same table frame line based on the cell pixel values of the adjacent cells corresponding to any one of the at least two table frame lines with the extension relation, so as to generate a corresponding frameless table image.
In one embodiment, when the processor executes the computer program, the performing border line pixel value replacement on each table border line based on each cell pixel value to generate a corresponding borderless table image may include: and randomly determining a target frame line from the table frame lines determined by the table image, and replacing the pixel values of the pixel points of each frame line of the target frame line based on the cell pixel values of the pixel points of the cell in the adjacent cells of the target frame line to obtain a frameless table image.
In one embodiment, the obtaining of the raw training data when the computer program is executed by the processor may include: acquiring original text data, wherein the original text data comprises a frame table; splitting each original text data according to the page number of the file to obtain each text page; recognizing text titles, text contents and text tables of the text pages, and determining coordinate positions corresponding to the text titles, the text contents and the text tables; based on each coordinate position, establishing text structure information corresponding to the original text data, and cutting form images from each text page; based on the tabular images and the text structure information, original training data is generated.
In one embodiment, the raw training data may further include the number of cells of the framed table in the table image and position index data between the cells.
In this embodiment, when executing the computer program, the processor generates the target training data according to the borderless form image and the text structure information of the corresponding framed form, and the generating may include: target training data is generated based on the borderless table image, the text structure information, the number of cells of the borderless table, and the position index data between the cells.
In this embodiment, when executing the computer program, the processor performs training on the constructed initial recovery model based on the target training data to obtain a recovery model of the borderless table after training, which may include: and training the constructed initial recovery model through target training data to obtain a trained frameless table recovery model.
In one embodiment, the processor, when executing the computer program, may further implement the following steps: uploading at least one of the form image, the text structure information, the borderless form image and the target training data to a blockchain node for storage.
In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of: acquiring original training data, wherein the original training data comprises a form image with a frame form in text data and text structure information; identifying the table border lines in the table image, preprocessing each identified table border line, and generating a borderless table image of the table image corresponding to the table with the border; generating target training data according to the borderless form image and the corresponding text structure information of the framed form; and training the constructed initial recovery model based on the target training data to obtain a trained frameless table recovery model.
In one embodiment, the computer program when executed by the processor for implementing the steps of identifying the border lines in the table image and preprocessing each identified border line in the table image to generate a borderless table image corresponding to the table image of the table with borders may include: identifying each pixel point in the table image, and determining each table frame line in the table image; determining adjacent cells corresponding to each table frame line based on each table frame line; obtaining cell pixel values of cell pixel points in each adjacent cell; and based on the pixel value of each cell, carrying out pixel value replacement on the frame line pixel point of each table frame line to generate a corresponding frameless table image.
In one embodiment, after the computer program is executed by the processor to determine the table border lines in the table image, the following steps may be further implemented: judging whether the table frame lines have extension relation or not; when it is determined that an extension relationship exists between the table frame lines, it is determined that at least two table frame lines having an extension relationship exist are the same table frame line.
In this embodiment, when executed by the processor, the implementing pixel value replacement for the frame line pixel point of each table frame line based on the pixel value of each cell to generate a corresponding borderless table image may include: and carrying out pixel value replacement on the frame line pixel points of at least two table frame lines determined as the same table frame line based on the cell pixel values of the adjacent cells corresponding to any one of the at least two table frame lines with the extension relation, so as to generate a corresponding frameless table image.
In one embodiment, when executed by a processor, the computer program implementing border line pixel value replacement for each table border line based on each cell pixel value to generate a corresponding borderless table image may include: and randomly determining a target frame line from the table frame lines determined by the table image, and replacing the pixel values of the pixel points of each frame line of the target frame line based on the cell pixel values of the pixel points of the cell in the adjacent cells of the target frame line to obtain a frameless table image.
In one embodiment, the computer program when executed by the processor to perform obtaining raw training data may include: acquiring original text data, wherein the original text data comprises a frame table; splitting each original text data according to the page number of the file to obtain each text page; recognizing text titles, text contents and text tables of the text pages, and determining coordinate positions corresponding to the text titles, the text contents and the text tables; based on each coordinate position, establishing text structure information corresponding to the original text data, and cutting form images from each text page; based on the tabular images and the text structure information, original training data is generated.
In one embodiment, the raw training data may further include the number of cells of the framed table in the table image and position index data between the cells.
In this embodiment, when executed by a processor, the implementing the generating the target training data according to the borderless form image and the text structure information of the corresponding framed form may include: target training data is generated based on the borderless table image, the text structure information, the number of cells of the borderless table, and the position index data between the cells.
In this embodiment, when executed by a processor, the training the constructed initial recovery model based on the target training data to obtain a trained borderless table recovery model may include: and training the constructed initial recovery model through target training data to obtain a trained frameless table recovery model.
In one embodiment, the computer program when executed by the processor may further implement the steps of: uploading at least one of the form image, the text structure information, the borderless form image and the target training data to a blockchain node for storage.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A borderless table recovery model training method is characterized by comprising the following steps:
acquiring original training data, wherein the original training data comprises a form image with a frame form in text data and text structure information;
identifying table border lines in the table image, and preprocessing each identified table border line to generate a borderless table image corresponding to the table image of the table with the border;
generating target training data according to the borderless form image and the text structure information of the corresponding framed form;
and training the constructed initial recovery model based on the target training data to obtain a trained frameless table recovery model.
2. The method of claim 1, wherein the identifying of the bezel border lines in the table image and the pre-processing of each of the identified table bezel lines to generate a borderless table image corresponding to the table image of the framed table comprises:
identifying each pixel point in the form image, and determining each form frame line in the form image;
determining adjacent cells corresponding to each table border line based on each table border line;
obtaining cell pixel values of cell pixel points in each adjacent cell;
and carrying out pixel value replacement on the frame line pixel points of the frame lines of the forms based on the pixel values of the cells to generate corresponding frameless form images.
3. The method of claim 2, wherein after determining each table border line in the table image, further comprising:
determining whether an extension relationship exists between the table border lines;
when the extension relationship exists between the table frame lines, determining that at least two table frame lines with the extension relationship exist as the same table frame line;
the pixel value replacement of the frame line pixel points of the frame lines of the forms is performed based on the pixel values of the cells to generate corresponding frameless form images, and the method comprises the following steps:
and performing pixel value replacement on the frame line pixel points of at least two table frame lines determined as the same table frame line based on the cell pixel values of the adjacent cells corresponding to any one of the at least two table frame lines with the extension relationship to generate a corresponding frameless table image.
4. The method of claim 2, wherein said bezel line pixel value replacement for each of said table bezel lines based on each of said cell pixel values to generate a corresponding borderless table image, comprises:
and randomly determining a target frame line from the table frame lines determined by the table image, and replacing the pixel values of the pixel points of each frame line of the target frame line based on the cell pixel values of the pixel points of the cell in the adjacent cells of the target frame line to obtain a frameless table image.
5. The method of claim 1, wherein the obtaining raw training data comprises:
acquiring original text data, wherein the original text data comprises a frame table;
splitting each original text data according to the page number of the file to obtain each text page;
identifying a text title, a text content and a text table of each text page, and determining a coordinate position corresponding to each text title, text content and text table;
establishing text structure information corresponding to the original text data based on the coordinate positions, and cutting form images from the text pages;
generating original training data based on the form image and the text structure information.
6. The method of claim 1, wherein the raw training data further comprises a number of cells of the framed table in the table image and position index data between the cells;
generating target training data according to the borderless form image and the text structure information of the corresponding framed form, including:
generating target training data based on the borderless table image, the text structure information, the number of cells of the borderless table, and position index data between the cells;
the training of the constructed initial recovery model based on the target training data to obtain the trained frameless table recovery model comprises the following steps:
and training the constructed initial recovery model through the target training data to obtain a trained frameless table recovery model.
7. The method according to any one of claims 1 to 6, further comprising:
uploading at least one of the form image, the text structure information, the borderless form image, and the target training data to a blockchain node for storage.
8. A borderless table recovery model training apparatus, the apparatus comprising:
the system comprises an original training data acquisition module, a frame table generation module and a frame table generation module, wherein the original training data acquisition module is used for acquiring original training data which comprises a table image with a frame table in text data and text structure information;
the borderless table image generation module is used for identifying the table border lines in the table image, preprocessing each identified table border line and generating a borderless table image corresponding to the table image of the table with the border;
the target training data generation module is used for generating target training data according to the borderless form image and the text structure information of the corresponding framed form;
and the training module is used for training the constructed initial recovery model based on the target training data to obtain a trained frameless table recovery model.
9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.
CN202111049186.6A 2021-09-08 2021-09-08 Borderless table recovery model training method, device, computer equipment and medium Pending CN113762158A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111049186.6A CN113762158A (en) 2021-09-08 2021-09-08 Borderless table recovery model training method, device, computer equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111049186.6A CN113762158A (en) 2021-09-08 2021-09-08 Borderless table recovery model training method, device, computer equipment and medium

Publications (1)

Publication Number Publication Date
CN113762158A true CN113762158A (en) 2021-12-07

Family

ID=78793906

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111049186.6A Pending CN113762158A (en) 2021-09-08 2021-09-08 Borderless table recovery model training method, device, computer equipment and medium

Country Status (1)

Country Link
CN (1) CN113762158A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114565927A (en) * 2022-03-03 2022-05-31 上海恒生聚源数据服务有限公司 Table identification method and device, electronic equipment and storage medium

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110163030A (en) * 2018-02-11 2019-08-23 鼎复数据科技(北京)有限公司 A kind of PDF based on image information has frame table abstracting method
CN110334585A (en) * 2019-05-22 2019-10-15 平安科技(深圳)有限公司 Table recognition method, apparatus, computer equipment and storage medium
CN111832403A (en) * 2020-06-04 2020-10-27 北京百度网讯科技有限公司 Document structure recognition method, and model training method and device for document structure recognition
CN111860257A (en) * 2020-07-10 2020-10-30 上海交通大学 Table identification method and system fusing multiple text features and geometric information
CN112241730A (en) * 2020-11-21 2021-01-19 杭州投知信息技术有限公司 Form extraction method and system based on machine learning
US20210073326A1 (en) * 2019-09-06 2021-03-11 Wipro Limited System and method for extracting tabular data from a document
CN112528863A (en) * 2020-12-14 2021-03-19 中国平安人寿保险股份有限公司 Identification method and device of table structure, electronic equipment and storage medium
CN112949443A (en) * 2021-02-24 2021-06-11 平安科技(深圳)有限公司 Table structure identification method and device, electronic equipment and storage medium
CN113239818A (en) * 2021-05-18 2021-08-10 上海交通大学 Cross-modal information extraction method of tabular image based on segmentation and graph convolution neural network

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110163030A (en) * 2018-02-11 2019-08-23 鼎复数据科技(北京)有限公司 A kind of PDF based on image information has frame table abstracting method
CN110334585A (en) * 2019-05-22 2019-10-15 平安科技(深圳)有限公司 Table recognition method, apparatus, computer equipment and storage medium
US20210073326A1 (en) * 2019-09-06 2021-03-11 Wipro Limited System and method for extracting tabular data from a document
CN111832403A (en) * 2020-06-04 2020-10-27 北京百度网讯科技有限公司 Document structure recognition method, and model training method and device for document structure recognition
CN111860257A (en) * 2020-07-10 2020-10-30 上海交通大学 Table identification method and system fusing multiple text features and geometric information
CN112241730A (en) * 2020-11-21 2021-01-19 杭州投知信息技术有限公司 Form extraction method and system based on machine learning
CN112528863A (en) * 2020-12-14 2021-03-19 中国平安人寿保险股份有限公司 Identification method and device of table structure, electronic equipment and storage medium
CN112949443A (en) * 2021-02-24 2021-06-11 平安科技(深圳)有限公司 Table structure identification method and device, electronic equipment and storage medium
CN113239818A (en) * 2021-05-18 2021-08-10 上海交通大学 Cross-modal information extraction method of tabular image based on segmentation and graph convolution neural network

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114565927A (en) * 2022-03-03 2022-05-31 上海恒生聚源数据服务有限公司 Table identification method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN109492643B (en) Certificate identification method and device based on OCR, computer equipment and storage medium
CN110334585B (en) Table identification method, apparatus, computer device and storage medium
CN111401371B (en) Text detection and identification method and system and computer equipment
CN109635838B (en) Face sample picture labeling method and device, computer equipment and storage medium
CN111626123A (en) Video data processing method and device, computer equipment and storage medium
CN110516541B (en) Text positioning method and device, computer readable storage medium and computer equipment
CN110866491A (en) Target retrieval method, device, computer readable storage medium and computer equipment
CN110647885B (en) Test paper splitting method, device, equipment and medium based on picture identification
US11010543B1 (en) Systems and methods for table extraction in documents
CN111666931B (en) Mixed convolution text image recognition method, device, equipment and storage medium
CN110866457A (en) Electronic insurance policy obtaining method and device, computer equipment and storage medium
CN113837151A (en) Table image processing method and device, computer equipment and readable storage medium
CN112232336A (en) Certificate identification method, device, equipment and storage medium
CN113159013A (en) Paragraph identification method and device based on machine learning, computer equipment and medium
CN113806613B (en) Training image set generation method, training image set generation device, computer equipment and storage medium
CN110889341A (en) Form image recognition method and device based on AI (Artificial Intelligence), computer equipment and storage medium
CN113762158A (en) Borderless table recovery model training method, device, computer equipment and medium
CN111159167A (en) Labeling quality detection device and method
CN113901768A (en) Standard file generation method, device, equipment and storage medium
CN112464660B (en) Text classification model construction method and text data processing method
CN113868419A (en) Text classification method, device, equipment and medium based on artificial intelligence
CN116384344A (en) Document conversion method, device and storage medium
CN112528599B (en) XML-based multi-page document processing method, device, computer equipment and medium
CN112463791A (en) Nuclear power station document data acquisition method and device, computer equipment and storage medium
CN115223183A (en) Information extraction method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20211207