CN117807972A

CN117807972A - Method, device, equipment and medium for extracting form information in long document

Info

Publication number: CN117807972A
Application number: CN202410187438.9A
Authority: CN
Inventors: 李宽; 岳小龙; 章逸骋; 胡嘉杰; 纪传俊
Original assignee: Daguan Data Co ltd
Current assignee: Daguan Data Co ltd
Priority date: 2024-02-20
Filing date: 2024-02-20
Publication date: 2024-04-02

Abstract

The invention discloses a method, a device, equipment and a medium for extracting form information in a long document. Obtaining rich text information of a target long document by obtaining the target long document to be extracted and carrying out document preprocessing operation on the target long document; inputting the rich text information of the target long document into a pre-trained form classification model for recognition to obtain at least one target form; and respectively carrying out feature extraction operation on each target cell corresponding to each target table through a preset cell rank vector extraction method to obtain a target row characterization vector and a target list characterization vector corresponding to each target cell, respectively inputting the target row characterization vector and the target list characterization vector into a pre-trained cell classification model for classification processing to obtain a table information extraction result, and carrying out feedback operation on the table information extraction result to a user. The problem that the form information of the long file cannot be extracted accurately is solved, and the accuracy and the efficiency of the form information extraction in the long file are improved.

Description

Method, device, equipment and medium for extracting form information in long document

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a method, an apparatus, a device, and a medium for extracting table information in a long document.

Background

Long documents often contain a large amount of information, including structured or semi-structured data, presented in tabular form. Such information may include important business, scientific or statistical data. However, the complexity of long documents and the diversity of forms makes it increasingly difficult to efficiently extract the required form information therefrom.

The inventors have found that the following drawbacks exist in the prior art in the process of implementing the present invention: currently, conventional document processing methods often have limitations in facing form information in long documents. Manually extracting form information is time consuming and laborious, and automated processing algorithms are often not adaptable to the diversity and complexity of long documents. One of the main problems faced by the current technology is to accurately and efficiently extract the required form information in large-scale long documents. Since the content of long documents may contain tables of various formats and different structures, conventional approaches often cannot flexibly accommodate this diversity. In addition, the arrangement mode of the table, the cell structure in the table and the context information of the table are changed, and the difficulty of correctly extracting information from the document is increased.

Disclosure of Invention

The invention provides a method, a device, equipment and a medium for extracting form information in a long document, so as to improve the accuracy and efficiency of form information extraction in the long document.

According to an aspect of the present invention, there is provided a method for extracting table information in a long document, including:

acquiring a target long document to be extracted of information, and performing document preprocessing operation on the target long document to obtain rich text information of the target long document;

inputting the rich text information of the target long document into a pre-trained form classification model for recognition to obtain at least one target form;

performing feature extraction operation on each target cell corresponding to each target table through a preset cell rank vector extraction method to obtain a target row characterization vector and a target list characterization vector corresponding to each target cell;

and respectively inputting the target row characterization vector and the target column characterization vector corresponding to each target cell into a pre-trained cell classification model for classification processing to obtain a table information extraction result, and feeding back the table information extraction result to a user.

According to another aspect of the present invention, there is provided a form information extraction apparatus in a long document, including:

the target long document rich text information determining module is used for acquiring a target long document to be extracted from information, and carrying out document preprocessing operation on the target long document to obtain target long document rich text information;

the target form determining module is used for inputting the rich text information of the target long document into a pre-trained form classification model for recognition to obtain at least one target form;

the target line characterization vector and target list characterization vector determining module is used for respectively carrying out feature extraction operation on each target cell corresponding to each target table through a preset cell rank vector extracting method to obtain a target line characterization vector and a target list characterization vector corresponding to each target cell;

the table information extraction result determining module is used for respectively inputting the target row characterization vector and the target column characterization vector corresponding to each target cell into a pre-trained cell classification model for classification processing to obtain a table information extraction result, and feeding back the table information extraction result to a user.

According to another aspect of the present invention, there is provided an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements a method for extracting table information in a long document according to any embodiment of the present invention when executing the computer program.

According to another aspect of the present invention, there is provided a computer readable storage medium storing computer instructions for causing a processor to implement a method for extracting table information in a long document according to any one of the embodiments of the present invention when executed.

According to the technical scheme, the target long document to be extracted is obtained, and document preprocessing operation is carried out on the target long document, so that rich text information of the target long document is obtained; inputting the rich text information of the target long document into a pre-trained form classification model for recognition to obtain at least one target form; and respectively carrying out feature extraction operation on each target cell corresponding to each target table through a preset cell rank vector extraction method to obtain a target row characterization vector and a target list characterization vector corresponding to each target cell, respectively inputting the target row characterization vector and the target list characterization vector into a pre-trained cell classification model for classification processing to obtain a table information extraction result, and carrying out feedback operation on the table information extraction result to a user. The problem that the form information of the long file cannot be extracted accurately is solved, and the accuracy and the efficiency of the form information extraction in the long file are improved.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the invention or to delineate the scope of the invention. Other features of the present invention will become apparent from the description that follows.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a method for extracting table information in a long document according to a first embodiment of the present invention;

fig. 2 is a schematic structural diagram of a table information extracting device in a long document according to a second embodiment of the present invention;

fig. 3 is a schematic structural diagram of an electronic device according to a third embodiment of the present invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

It should be noted that the terms "target," "current," and the like in the description and claims of the present invention and the above-described drawings are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Example 1

Fig. 1 is a flowchart of a method for extracting table information in a long document according to an embodiment of the present invention, where the method may be performed by a table information extracting device in a long document, and in particular, by a new open platform, and the table information extracting device in a long document may be implemented in hardware and/or software.

Accordingly, as shown in fig. 1, the method includes:

s110, acquiring a target long document to be extracted, and performing document preprocessing operation on the target long document to obtain rich text information of the target long document.

The target long document to be extracted can be a long document needing to be extracted.

Specifically, the rich text information of the target long document may include at least one of the following description information: different document structure information such as title, paragraph, table, header, footer, picture and catalog.

In this embodiment, the document preprocessing operation is performed on the target long document to be extracted, and the rich text information of the target long document can be generated by preprocessing the long document by using technologies such as optical character recognition, natural language processing, layout analysis, text analysis, and table analysis.

S120, inputting the rich text information of the target long document into a pre-trained form classification model for recognition to obtain at least one target form.

Wherein, the form classification model may be a model capable of performing form classification. The target form may be one or more forms identified in the long document.

Optionally, the inputting the rich text information of the target long document into a pre-trained form classification model for recognition to obtain at least one target form includes: acquiring and extracting form features of each description information in the rich text information of the target long document according to preset form classification feature information to obtain at least one form feature characterization vector; wherein the table classification characteristic information comprises at least one of the following: multi-level form title, paragraph text above the form, paragraph text below the form, line head text of the form, column head text of the form, and form information itself; and inputting each form characteristic representation vector into a pre-trained form classification model for recognition to obtain at least one target form.

The table classification feature information may be feature information describing table classification, and specifically, the table classification feature information may include a multi-level table title, a text of a previous paragraph of the table, a text of a next paragraph of the table, a line head text of the table, a column head text of the table, and information of the table itself. The table feature characterization vector may be a vector for characterizing the table feature, and may be a vector obtained by extracting features from the rich text information of the target long document.

In this embodiment, the characterization vectors of the different table features may be extracted by word embedding, sentence embedding, LSTM (Long Short-Term Memory artificial neural network), CNN (Convolutional Neural Networks, convolutional neural network) and other technologies, and the table feature characterization vectors may be flexibly selected according to the document type. One or more target tables may then be identified by a table classification model.

S130, performing feature extraction operation on each target cell corresponding to each target table through a preset cell rank vector extraction method to obtain a target row characterization vector and a target list characterization vector corresponding to each target cell.

The cell rank vector extraction method may be a method of performing rank vector extraction on each cell in the target table. The target row characterization vector may be a vector obtained by extracting row features of the target cell. The target list feature vector may be a vector obtained by extracting a column feature of the target cell.

In this embodiment, a rank vector extraction process needs to be performed on each cell in the target table, and further a target row characterization vector and a target list characterization vector may be obtained.

Specifically, for each target cell included in the target table, the text sequence of the cell included in the whole row where the target cell is located and the position where the target cell is located in the whole row of cells are taken as row characteristics of the target cell, the text sequence of the cell included in the whole row and the position where the target cell is located in the whole row of cells are taken as column characteristics of the target cell, and the row characterization vector and the list characterization vector of the target cell are extracted by using word embedding, sentence embedding, LSTM (least square) and other technologies.

S140, respectively inputting the target row characterization vector and the target column characterization vector corresponding to each target cell into a pre-trained cell classification model for classification processing to obtain a form information extraction result, and feeding back the form information extraction result to a user.

The cell classification model may be a model that classifies cells. The form information extraction result may be form information extracted from a long document.

In this embodiment, it is necessary to perform the classification processing operation of each target cell corresponding to a long document by using a cell classification model. The target row characterization vector and the target list characterization vector corresponding to the target cell are classified, so that the table information extraction operation can be better performed, the obtained table information extraction result is more accurate, the feedback processing can be performed on the table information extraction result, and the user can perform the feedback processing operation better.

Prior to this, model training of the table classification model and the cell classification model was also required. The process of training the table classification model and the cell classification model is specifically described below.

For the training process of the table classification model, before the target long document to be extracted is obtained and document preprocessing operation is performed on the target long document to obtain rich text information of the target long document, the training process further comprises: acquiring a plurality of history long documents and at least one history target table corresponding to each history long document respectively; performing document preprocessing operation on each history long document to obtain rich text information of the history long document; inputting the rich text information of the history long document into an initial form classification model for recognition to obtain a model output target form; comparing the model output target table with the historical target table to obtain a model accuracy comparison result; and if the model accuracy comparison result is a model training completion result, determining that the training of the form classification model is completed.

The history long document may be a long document obtained at a history time. The history target table may be one or more tables identified in a history long document. The rich text information of the history long document can be rich text information obtained by document preprocessing of the history long document. The model output target table may be at least one target table obtained by processing the history long document through the initial table classification model, that is, the model output target table is obtained. The model accuracy comparison results may include model training complete results and model training incomplete results.

In the present embodiment, since after the history long document is acquired, the history target table corresponding to the history long document can be acquired. Meanwhile, the target table in the history long document can be identified through the initial table classification model, and the model output target table can be obtained. Therefore, the model accuracy comparison result can be further obtained by comparing the relation between the historical target table and the model output target table. Correspondingly, whether the form classification model is completed or not can be further judged through the model accuracy comparison result.

Optionally, after comparing the model output target table with the history target table to obtain a model accuracy comparison result, the method further includes: and if the model accuracy rate comparison result is not the model training completion result, returning to execute the operation of acquiring a plurality of history long documents until the model accuracy rate comparison result is the model training completion result, and determining that the training of the form classification model is completed.

In this embodiment, if the model accuracy comparison result is a model training completion result, training of the table classification model is determined to be completed; otherwise, the training of the form classification model is not completed, so that the history long document needs to be returned to be continuously acquired for retraining the form classification model, and the trained form classification model can be obtained.

Optionally, the comparing the model output target table with the history target table to obtain a model accuracy comparison result includes: comparing the model output target table with the history target table, and calculating to obtain model output accuracy; acquiring a preset accuracy threshold of a form classification model, judging whether the model output accuracy meets the accuracy threshold of the form classification model, if so, determining that a model accuracy comparison result is a model training completion result; if not, determining the model accuracy comparison result as a model training incomplete result.

The model output accuracy rate may be the accuracy rate of calculating the target table output by the table classification model. The table classification model accuracy threshold may be a preset threshold of table classification model accuracy.

In this embodiment, the accuracy of model output may be further obtained by performing accuracy calculation on the target table output by the table classification model and the history target table corresponding to the history long document.

The model output accuracy calculated by the false design is a, the accuracy threshold of the table classification model is b, the size relation between a and b can be further judged, and if a is greater than (or equal to) b, the model accuracy comparison result can be further determined to be the model training completion result. Otherwise, assuming that a is smaller than b, the model accuracy comparison result can be further determined to be a model training incomplete result.

For the training process of the cell classification model, before the target long document to be extracted is obtained and document preprocessing operation is performed on the target long document, the training process further comprises the steps of: acquiring a plurality of history long documents; at least one history target table corresponding to each history long document respectively, and a standard table information extraction result corresponding to each history target table in each history long document; performing feature extraction operation on each historical target cell corresponding to each historical target table through a preset cell rank vector extraction method to obtain a historical target row characterization vector and a historical target list characterization vector corresponding to each historical target cell; respectively inputting the historical target row characterization vector and the historical target column characterization vector corresponding to each historical target cell into an initial cell classification model for classification processing to obtain a model table information extraction result; calculating the accuracy of model classification processing according to the standard form information extraction result and the model form information extraction result to obtain the accuracy of model classification processing; and acquiring and determining whether training of the cell classification model is completed according to a preset classification processing accuracy threshold.

The standard table information extraction result may be a table information extraction result corresponding to the history long document, may be a table information extraction result stored in advance, and may be obtained through manual analysis. The purpose of obtaining the standard form information extraction result is to compare with the form information extraction result output by the model, so that the model classification processing accuracy can be calculated.

The cell rank vector extraction method may be a method of performing rank vector extraction operation on cells. The history object row token vector may be a vector used to describe a row token of a history object table in a history long document. The history object list symptom vector may be a vector used to describe list symptoms of history object tables in a history long document. The model table information extraction result may be a result obtained by extracting the initial cell classification model. The model classification process accuracy may be a size used to describe the accuracy of the initial cell classification model to output the form information extraction result. The classification accuracy threshold may be a case of a threshold size of classification accuracy that the model needs to reach.

In this embodiment, feature extraction may be performed on the historical target table to obtain a historical target row characterization vector and a historical target list characterization vector; and classification processing can be carried out in the initial cell classification model to obtain a model form information extraction result.

Further, the model classification accuracy can be obtained by calculating the accuracy of the model table information extraction result and the model table information extraction result, and the model classification accuracy is c, so that the model classification accuracy is also required to be compared with the classification accuracy threshold d to judge whether the training operation of the cell classification model is completed.

Optionally, the obtaining and determining whether training of the cell classification model is completed according to a preset classification processing accuracy threshold includes: judging whether the model classification accuracy reaches the classification accuracy threshold, if so, determining that training is completed on the cell classification model; and if not, returning to execute the operation of acquiring a plurality of history long documents, and determining that training is completed on the cell classification model until the model classification accuracy reaches the classification accuracy threshold.

In the previous example, if the model classification accuracy c is greater than or equal to the classification accuracy threshold d, it may be determined that the requirement of the model cell classification accuracy is met, and thus it may be determined that the training operation of the cell classification model has been completed.

Further, if the model classification accuracy c is smaller than the classification accuracy threshold d, it may be determined that the requirement of the model cell classification accuracy is not met, so that it is necessary to continuously obtain a long history document and perform retraining operation of the cell classification model.

Example two

Fig. 2 is a schematic structural diagram of a table information extraction device in a long document according to a second embodiment of the present invention. The table information extraction device in the long document provided in the embodiment of the invention can be implemented through software and/or hardware, and can be configured in a terminal device or a server to implement the table information extraction method in the long document in the embodiment of the invention. As shown in fig. 2, the apparatus includes: a target long document rich text information determination module 210, a target table determination module 220, a target row token vector and target list token vector determination module 230, and a table information extraction result determination module 240.

The target long document rich text information determining module 210 is configured to obtain a target long document to be extracted from information, and perform document preprocessing operation on the target long document to obtain target long document rich text information;

the target table determining module 220 is configured to input the target long document rich text information into a pre-trained table classification model for recognition, so as to obtain at least one target table;

the target line characterization vector and target list characterization vector determining module 230 is configured to perform feature extraction operation on each target cell corresponding to each target table through a preset cell rank vector extraction method, so as to obtain a target line characterization vector and a target list characterization vector corresponding to each target cell;

The table information extraction result determining module 240 is configured to input the target row token vector and the target column token vector corresponding to each target cell into a pre-trained cell classification model respectively for classification processing, obtain a table information extraction result, and perform feedback operation on the table information extraction result to a user.

On the basis of the above embodiments, the target long document rich text information includes at least one of the following description information: headings, paragraphs, tables, headers, footers, pictures, and directories.

On the basis of the above embodiments, the target table determining module 220 may be specifically configured to: acquiring and extracting form features of each description information in the rich text information of the target long document according to preset form classification feature information to obtain at least one form feature characterization vector; wherein the table classification characteristic information comprises at least one of the following: multi-level form title, paragraph text above the form, paragraph text below the form, line head text of the form, column head text of the form, and form information itself; and inputting each form characteristic representation vector into a pre-trained form classification model for recognition to obtain at least one target form.

Based on the above embodiments, the table classification model training module may be specifically configured to: before the target long document to be extracted is obtained and document preprocessing operation is carried out on the target long document to obtain rich text information of the target long document, a plurality of history long documents and at least one history target table respectively corresponding to the history long documents are obtained; performing document preprocessing operation on each history long document to obtain rich text information of the history long document; inputting the rich text information of the history long document into an initial form classification model for recognition to obtain a model output target form; comparing the model output target table with the historical target table to obtain a model accuracy comparison result; and if the model accuracy comparison result is a model training completion result, determining that the training of the form classification model is completed.

Based on the above embodiments, the table classification model training module may be further specifically configured to: after the model output target table and the historical target table are compared to obtain a model accuracy comparison result, if the model accuracy comparison result is not the model training completion result, the operation of obtaining a plurality of historical long documents is returned to be executed until the model accuracy comparison result is the model training completion result, and the training of the table classification model is determined to be completed.

Based on the above embodiments, the table classification model training module may be further specifically configured to: comparing the model output target table with the history target table, and calculating to obtain model output accuracy; acquiring a preset accuracy threshold of a form classification model, judging whether the model output accuracy meets the accuracy threshold of the form classification model, if so, determining that a model accuracy comparison result is a model training completion result; if not, determining the model accuracy comparison result as a model training incomplete result.

Based on the above embodiments, the cell classification model training module may be specifically configured to: acquiring a plurality of history long documents before acquiring target long documents to be extracted by information and performing document preprocessing operation on the target long documents to obtain rich text information of the target long documents; at least one history target table corresponding to each history long document respectively, and a standard table information extraction result corresponding to each history target table in each history long document; performing feature extraction operation on each historical target cell corresponding to each historical target table through a preset cell rank vector extraction method to obtain a historical target row characterization vector and a historical target list characterization vector corresponding to each historical target cell; respectively inputting the historical target row characterization vector and the historical target column characterization vector corresponding to each historical target cell into an initial cell classification model for classification processing to obtain a model table information extraction result; calculating the accuracy of model classification processing according to the standard form information extraction result and the model form information extraction result to obtain the accuracy of model classification processing; and acquiring and determining whether training of the cell classification model is completed according to a preset classification processing accuracy threshold.

Based on the above embodiments, the cell classification model training module may be further specifically configured to: judging whether the model classification accuracy reaches the classification accuracy threshold, if so, determining that training is completed on the cell classification model; and if not, returning to execute the operation of acquiring a plurality of history long documents, and determining that training is completed on the cell classification model until the model classification accuracy reaches the classification accuracy threshold.

The table information extraction device in the long document provided by the embodiment of the invention can execute the table information extraction method in the long document provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.

Example III

Fig. 3 shows a schematic diagram of the structure of an electronic device 10 that may be used to implement a third embodiment of the invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic equipment may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.

As shown in fig. 3, the electronic device 10 includes at least one processor 11, and a memory, such as a Read Only Memory (ROM) 12, a Random Access Memory (RAM) 13, etc., communicatively connected to the at least one processor 11, in which the memory stores a computer program executable by the at least one processor, and the processor 11 may perform various appropriate actions and processes according to the computer program stored in the Read Only Memory (ROM) 12 or the computer program loaded from the storage unit 18 into the Random Access Memory (RAM) 13. In the RAM 13, various programs and data required for the operation of the electronic device 10 may also be stored. The processor 11, the ROM 12 and the RAM 13 are connected to each other via a bus 14. An input/output (I/O) interface 15 is also connected to bus 14.

Various components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16 such as a keyboard, a mouse, etc.; an output unit 17 such as various types of displays, speakers, and the like; a storage unit 18 such as a magnetic disk, an optical disk, or the like; and a communication unit 19 such as a network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, digital Signal Processors (DSPs), and any suitable processor, controller, microcontroller, etc. The processor 11 performs the various methods and processes described above, such as a form information extraction method in a long document.

In some embodiments, a method of table information extraction in long documents may be implemented as a computer program tangibly embodied on a computer-readable storage medium, such as storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 12 and/or the communication unit 19. When the computer program is loaded into the RAM 13 and executed by the processor 11, one or more steps of the table information extraction method in the long document described above may be performed. Alternatively, in other embodiments, processor 11 may be configured to perform a form information extraction method in a long document in any other suitable manner (e.g., by means of firmware).

The method comprises the following steps: acquiring a target long document to be extracted of information, and performing document preprocessing operation on the target long document to obtain rich text information of the target long document; inputting the rich text information of the target long document into a pre-trained form classification model for recognition to obtain at least one target form; performing feature extraction operation on each target cell corresponding to each target table through a preset cell line vector extraction method, obtaining a target row characterization vector and a target list characterization vector corresponding to each target cell; and respectively inputting the target row characterization vector and the target column characterization vector corresponding to each target cell into a pre-trained cell classification model for classification processing to obtain a table information extraction result, and feeding back the table information extraction result to a user.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

A computer program for carrying out methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be implemented. The computer program may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. The computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) through which a user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.

The computing system may include clients and servers. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service are overcome.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present invention may be performed in parallel, sequentially, or in a different order, so long as the desired results of the technical solution of the present invention are achieved, and the present invention is not limited herein.

The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Example IV

A fourth embodiment of the present invention also provides a computer-readable storage medium containing computer-readable instructions, which when executed by a computer processor, are configured to perform a method of table information extraction in a long document, the method comprising: acquiring a target long document to be extracted of information, and performing document preprocessing operation on the target long document to obtain rich text information of the target long document; inputting the rich text information of the target long document into a pre-trained form classification model for recognition to obtain at least one target form; performing feature extraction operation on each target cell corresponding to each target table through a preset cell rank vector extraction method to obtain a target row characterization vector and a target list characterization vector corresponding to each target cell; and respectively inputting the target row characterization vector and the target column characterization vector corresponding to each target cell into a pre-trained cell classification model for classification processing to obtain a table information extraction result, and feeding back the table information extraction result to a user.

Of course, the embodiment of the present invention provides a computer-readable storage medium, and the computer-executable instructions are not limited to the method operations described above, but may also perform the related operations in the table information extraction method in the long document provided in any embodiment of the present invention.

From the above description of embodiments, it will be clear to a person skilled in the art that the present invention may be implemented by means of software and necessary general purpose hardware, but of course also by means of hardware, although in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, etc., and include several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments of the present invention.

It should be noted that, in the embodiment of the table information extraction device in a long document, each unit and module included in the device are only divided according to the functional logic, but not limited to the above division, so long as the corresponding function can be implemented; in addition, the specific names of the functional units are also only for distinguishing from each other, and are not used to limit the protection scope of the present invention.

Claims

1. A method for extracting form information in a long document, comprising:

2. The method of claim 1, wherein the target long document rich text information includes at least one of the following descriptive information: headings, paragraphs, tables, headers, footers, pictures, and directories;

inputting the rich text information of the target long document into a pre-trained form classification model for recognition to obtain at least one target form, wherein the method comprises the following steps:

acquiring and extracting form features of each description information in the rich text information of the target long document according to preset form classification feature information to obtain at least one form feature characterization vector;

wherein the table classification characteristic information comprises at least one of the following: multi-level form title, paragraph text above the form, paragraph text below the form, line head text of the form, column head text of the form, and form information itself;

and inputting each form characteristic representation vector into a pre-trained form classification model for recognition to obtain at least one target form.

3. The method according to claim 2, wherein before the obtaining the target long document to be extracted from the information and performing document preprocessing on the target long document to obtain the rich text information of the target long document, the method further comprises:

Acquiring a plurality of history long documents and at least one history target table corresponding to each history long document respectively;

performing document preprocessing operation on each history long document to obtain rich text information of the history long document;

inputting the rich text information of the history long document into an initial form classification model for recognition to obtain a model output target form;

comparing the model output target table with the historical target table to obtain a model accuracy comparison result;

and if the model accuracy comparison result is a model training completion result, determining that the training of the form classification model is completed.

4. A method according to claim 3, further comprising, after comparing the model output target table with the history target table to obtain a model accuracy comparison result:

and if the model accuracy rate comparison result is not the model training completion result, returning to execute the operation of acquiring a plurality of history long documents until the model accuracy rate comparison result is the model training completion result, and determining that the training of the form classification model is completed.

5. The method of claim 4, wherein comparing the model output target table with the historical target table to obtain a model accuracy comparison result comprises:

Comparing the model output target table with the history target table, and calculating to obtain model output accuracy;

acquiring a preset accuracy threshold of a form classification model, judging whether the model output accuracy meets the accuracy threshold of the form classification model, if so, determining that a model accuracy comparison result is a model training completion result;

if not, determining the model accuracy comparison result as a model training incomplete result.

6. The method according to claim 1, wherein before the obtaining the target long document to be extracted from the information and performing document preprocessing on the target long document to obtain the rich text information of the target long document, the method further comprises:

acquiring a plurality of history long documents;

at least one history target table corresponding to each history long document respectively, and a standard table information extraction result corresponding to each history target table in each history long document;

performing feature extraction operation on each historical target cell corresponding to each historical target table through a preset cell rank vector extraction method to obtain a historical target row characterization vector and a historical target list characterization vector corresponding to each historical target cell;

Respectively inputting the historical target row characterization vector and the historical target column characterization vector corresponding to each historical target cell into an initial cell classification model for classification processing to obtain a model table information extraction result;

calculating the accuracy of model classification processing according to the standard form information extraction result and the model form information extraction result to obtain the accuracy of model classification processing;

and acquiring and determining whether training of the cell classification model is completed according to a preset classification processing accuracy threshold.

7. The method of claim 6, wherein the acquiring and determining whether training of the cell classification model is complete based on a pre-set classification process accuracy threshold comprises:

judging whether the model classification accuracy reaches the classification accuracy threshold, if so, determining that training is completed on the cell classification model;

and if not, returning to execute the operation of acquiring a plurality of history long documents, and determining that training is completed on the cell classification model until the model classification accuracy reaches the classification accuracy threshold.

8. A form information extraction apparatus in a long document, comprising:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements a method for extracting form information in a long document according to any one of claims 1-7 when executing the computer program.

10. A computer readable storage medium storing computer instructions for causing a processor to implement a method of extracting table information in a long document according to any one of claims 1-7 when executed.