CN114328536A

CN114328536A - Table processing method and system

Info

Publication number: CN114328536A
Application number: CN202111659254.0A
Authority: CN
Inventors: 徐阿龙; 陶志伟
Original assignee: Hithink Royalflush Information Network Co Ltd
Current assignee: Hithink Royalflush Information Network Co Ltd
Priority date: 2021-12-30
Filing date: 2021-12-30
Publication date: 2022-04-12

Abstract

The embodiment of the specification provides a table processing method and system. The method comprises the following steps: acquiring a table to be processed; processing the table to be processed based on the table head detection model, and determining a row table head and/or a list head of the table to be processed; processing the head of the row based on the head classification model to determine the classification result of the column in the table to be processed, and/or processing the head of the row based on the head classification model to determine the classification result of the row in the table to be processed; and extracting the table to be processed based on the classification result of the columns and/or rows in the table to be processed, and determining a first extraction result.

Description

Table processing method and system

Technical Field

The present disclosure relates to the field of data processing, and in particular, to a method and system for processing a table.

Background

The table may store various information (e.g., financial data for a company, etc.) in a structured manner. The user can extract the information needed by the user in the form through some limiting conditions or natural language description query conditions and the like. However, when there are a plurality of tables or a large number of tables, the user cannot precisely locate a desired table in a short time by the above method, and the efficiency of extracting information necessary for the table is low.

Accordingly, it is desirable to provide a form processing method and system that can accurately locate a desired form and improve the efficiency of form extraction.

Disclosure of Invention

One of embodiments of the present specification provides a form processing method, including: acquiring a table to be processed; processing the table to be processed based on a table header detection model, and determining a row table header and/or a list header of the table to be processed; processing the row header based on a header classification model to determine a classification result of a column in the table to be processed, and/or processing the column header based on the header classification model to determine a classification result of a row in the table to be processed; and extracting the table to be processed based on the classification result of the columns and/or rows in the table to be processed, and determining a first extraction result.

One of the embodiments of the present specification provides a form processing system, including: the acquisition module is used for acquiring a table to be processed; the first determining module is used for processing at least one row and/or at least one column in the table to be processed based on a table header detection model, and determining a row table header and/or a list header in the table to be processed; a second determining module, configured to process the row header based on a header classification model, determine a classification result of a column in the table to be processed, and/or process the column header based on the header classification model, and determine a classification result of a row in the table to be processed; and the extraction module is used for extracting the table to be processed based on each column and/or each row classification result in the table to be processed and determining a first extraction result.

One of the embodiments of the present specification provides a table processing apparatus, which includes at least one processor and at least one memory; the at least one memory is for storing computer instructions; the at least one processor is configured to execute at least a portion of the computer instructions to implement the form processing method as described in any of the above embodiments.

One of the embodiments of the present specification provides a computer-readable storage medium storing computer instructions that, when executed by a processor, implement a table processing method as described in any one of the above embodiments.

Drawings

The present description will be further explained by way of exemplary embodiments, which will be described in detail by way of the accompanying drawings. These embodiments are not intended to be limiting, and in these embodiments like numerals are used to indicate like structures, wherein:

FIG. 1 is an exemplary block diagram of a form processing system according to some embodiments of the present description;

FIG. 2 is an exemplary flow diagram of a form processing method according to some embodiments of the present description;

FIG. 3A is a schematic diagram of a form to be processed according to some embodiments of the present description;

FIG. 3B is yet another schematic diagram of a form to be processed according to some embodiments of the present description;

FIG. 4 is yet another exemplary flow diagram of a form processing method according to some embodiments of the present description;

FIG. 5 is an exemplary flow diagram illustrating the determination of a form to be processed according to some embodiments of the present description;

FIG. 6 is a schematic diagram illustrating detection of a linehead in accordance with some embodiments of the present description;

FIG. 7 is a schematic diagram of a detection list header shown in accordance with some embodiments of the present description;

FIG. 8 is a schematic illustration of determining a classification result for a column according to some embodiments of the present description;

FIG. 9 is a schematic illustration of determining a classification result for a row in accordance with certain embodiments of the present description;

FIG. 10A is a schematic diagram of a mask matrix shown in accordance with some embodiments of the present description;

FIG. 10B is yet another schematic diagram of a mask matrix shown in accordance with some embodiments of the present description.

Detailed Description

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings used in the description of the embodiments will be briefly described below. It is obvious that the drawings in the following description are only examples or embodiments of the present description, and that for a person skilled in the art, the present description can also be applied to other similar scenarios on the basis of these drawings without inventive effort. Unless otherwise apparent from the context, or otherwise indicated, like reference numbers in the figures refer to the same structure or operation.

It should be understood that "system", "apparatus", "unit" and/or "module" as used herein is a method for distinguishing different components, elements, parts, portions or assemblies at different levels. However, other words may be substituted by other expressions if they accomplish the same purpose.

As used in this specification and the appended claims, the terms "a," "an," "the," and/or "the" are not intended to be inclusive in the singular, but rather are intended to be inclusive in the plural, unless the context clearly dictates otherwise. In general, the terms "comprises" and "comprising" merely indicate that steps and elements are included which are explicitly identified, that the steps and elements do not form an exclusive list, and that a method or apparatus may include other steps or elements.

Flow charts are used in this description to illustrate operations performed by a system according to embodiments of the present description. It should be understood that the preceding or following operations are not necessarily performed in the exact order in which they are performed. Rather, the various steps may be processed in reverse order or simultaneously. Meanwhile, other operations may be added to the processes, or a certain step or several steps of operations may be removed from the processes.

FIG. 1 is an exemplary block diagram of a form processing system 100 shown in accordance with some embodiments of the present description.

In some embodiments, the form processing system 100 may include an acquisition module 110, a first determination module 120, a second determination module 130, and an extraction module 140.

The obtaining module 110 may be configured to obtain a table to be processed. For more details of the table to be processed, refer to fig. 2 and the related description thereof, which are not repeated herein.

The first determining module 120 may be configured to process at least one row and/or at least one column in the table to be processed based on the table header detection model, and determine a row table header and/or a list header in the table to be processed. For more details on the header detection model, the row header and the list header, refer to fig. 2 and the related description thereof, which are not repeated herein.

The second determining module 130 may be configured to process the row header and/or the list header based on the header classification model, and determine a classification result for each column and/or each row in the table to be processed. For more details on the table head classification model and the classification result, refer to fig. 2 and the related description thereof, which are not repeated herein.

The extraction module 140 may be configured to extract the table to be processed based on the classification result of each column and/or each row in the table to be processed, and determine a first extraction result. For more details of the first extraction result, refer to fig. 2 and the related description thereof, which are not repeated herein.

In some embodiments, the table processing system 100 may also include a third determination module 150 and a cell extraction module 160.

The third determining module 150 may process the words of the cells in the table to be processed based on the text classification model, and determine the types of the words of the cells in the table to be processed. For more details on the text classification model, the words of the cells, and the types of the words of the cells, reference is made to fig. 4 and the related description thereof, which are not repeated herein.

The cell extraction module 160 may extract the table to be processed based on the type of the word of the cell in the table to be processed, and determine a second extraction result. For more details of the second extraction result, refer to fig. 4 and its related description, which are not repeated herein.

It should be understood that the system and its modules shown in FIG. 1 may be implemented in a variety of ways.

It should be noted that the above description of the modules is only for convenience of description and should not limit the present disclosure to the scope of the illustrated embodiments. It will be appreciated by those skilled in the art that, given the teachings of the present system, any combination of modules or sub-system configurations may be used to connect to other modules without departing from such teachings. In some embodiments, the modules disclosed in fig. 1 may be different modules in a system, or may be a module that implements the functions of two or more modules described above. For example, each module may share one memory module, and each module may have its own memory module. Such variations are within the scope of the present disclosure.

FIG. 2 is an exemplary flow diagram of a form processing method according to some embodiments of the present description. As shown in fig. 2, the process 200 includes the following steps.

Step 210, obtaining a table to be processed. In some embodiments, step 210 may be performed by acquisition module 110.

The table to be processed may refer to a table that needs to be extracted. There may be information in the table to be processed that is of interest to the user. In some embodiments, the table to be processed may be obtained in a variety of ways. For example, the table to be processed may be a table generated by the user and stored in the processor, and may be directly obtained from the processor. As another example, the table to be processed may also be obtained from the network.

In some embodiments, the table to be processed may also be obtained based on the initial table and its title. For more details on obtaining the table to be processed based on the initial table and the title thereof, refer to fig. 5 and the related description thereof, which are not repeated herein.

In some embodiments, an initial table may also be obtained, and then the initial table is cleaned, redundant rows and/or columns in the initial table are deleted, and a table to be processed is determined. Where redundant rows and/or columns may refer to blank rows and/or columns. For example, the column numbers corresponding to all the non-empty cells may be obtained and subjected to deduplication processing to obtain a minimum subset of the column numbers corresponding to all the non-empty cells, the column numbers corresponding to the empty cells in the table are determined based on the minimum subset, the columns corresponding to the empty cells in the table are removed, a rearrangement index is obtained, the columns from which the empty cells in the table are removed are rearranged to obtain a cleaned table, and the cleaned table is used as a table to be processed. For more about the initial table, refer to fig. 5 and the related description thereof, and for more about the column number, refer to fig. 8 and the related description thereof, which are not repeated herein.

As shown in fig. 3B, the table may be cleaned, and the contents of the second row and the second column in the table are deleted, so as to obtain the cleaned table shown in fig. 3A, and the cleaned table may be used as the table to be processed.

Step 220, processing the table to be processed based on the table header detection model, and determining the row table header and/or the list header of the table to be processed. In some embodiments, step 220 may be performed by the first determination module 120.

The header may refer to the beginning of the table and may be used to categorize the contents of the table. The table is composed of rows and columns, and correspondingly, the table header can be divided into a row table header and a list header, the row table header can refer to the table header located on a row in the table, each cell in the row table header can classify the column corresponding to the cell, the list header can refer to the table header located on a column in the table, and each cell in the list header can classify the row corresponding to the cell.

It should be understood that a row header and/or a column header are included in the table, wherein the row header may be one or more rows in the table and the column header may be one or more columns in the table. As shown in fig. 3, the table includes two rows of headers and two columns of headers, where "category", "subject", "2030", "2031", "first year" and "second year" are contents in the row headers, and "income", "expense", "sales income", "other income", "tax", and "payroll" are contents in the columns of headers.

In some embodiments, the table to be processed may be input into the header detection model and output as a row header and/or a list header of the table to be processed.

In some embodiments, the head detection model may include a row head detection model and a list head detection model.

In some embodiments, a row header of a table to be processed may be determined based on a row header detection model processing rows in the table to be processed. In some embodiments, cells of a row in a table to be processed are spliced based on a row splicing rule to obtain a first splicing result; and processing the first splicing result based on the line header detection model, and determining the line header of the table to be processed. For more details about the above embodiment, refer to fig. 6 and its related description, which are not repeated herein. In some embodiments, the row header of the table to be processed may also be determined based on the row header detection model in other ways. For example, the table to be processed may be directly input into the line header detection model and output as the line header of the table to be processed.

In some embodiments, the columns in the table to be processed may be processed based on the head of list detection model, and the head of the table to be processed may be determined. In some embodiments, the cells in the columns in the table to be processed are column-spliced based on the column splicing rule to obtain a second splicing result; and processing the second splicing result based on the list head detection model, and determining the list head of the to-be-processed list. For more on the above embodiment, refer to fig. 7 and the related description, and the details are not repeated herein. In some embodiments, the list head of the table to be processed may also be determined based on the list head detection model in other ways. For example, the table to be processed may be directly input into the list header detection model and output as the list header of the table to be processed.

In some embodiments, the row header and/or the list header may also be determined by other means. For example, according to a preset rule, the first row in the table to be processed is directly used as a row header, and the first column is used as a column header.

Step 230, processing the row header based on the header classification model to determine the classification result of the columns in the table to be processed, and/or processing the list header based on the header classification model to determine the classification result of the rows in the table to be processed. In some embodiments, step 230 may be performed by the second determination module 130.

The classification result of the row may refer to a result of classifying the contents represented by the respective rows in the table to be processed. Correspondingly, the classification result of the column may refer to a result of classifying the content represented by each column in the table to be processed. In some embodiments, the classification results for the rows and/or columns may contain only the categories desired by the user, among others. As shown in fig. 3A, the first row to the second row are determined as the row header of the table, and the category required by the user is only time, so that the categories of the first column and the second column in the table may be determined as others, and the categories of the third column to the fifth column may be determined as time.

In some embodiments, the second determination module may input the row header of the table to be processed into the header classification model and output the classification result as a column. In some embodiments, the second determining module 130 may further input the list header of the table to be processed into the header classification model, and output the classification result of the row.

In some embodiments, before the head of a line and/or a list is input into the head classification model, the second determination module may further determine whether the length of the character of the head of the line and/or the head of the list is greater than a preset threshold; if the character length of the row header and/or the list header is larger than a preset threshold value, segmenting the row header and/or the list header to determine a plurality of groups of sub-row headers and/or sub-list headers; and taking the sub-row header and/or the sub-list header of each group as the input of the header classification model. In other words, classification is performed based on a header classification model, a group of sub-row headers and/or sub-list headers are input at a time, and all the row headers and the list headers are processed through multiple executions.

In some embodiments, the head classification models may include a row head classification model and a head classification model.

In some embodiments, the second determination module may process the row header based on the row header classification model, and determine the classification result of the column in the table to be processed. In some embodiments, the cells in the row header may be spliced based on the column splicing rule to obtain a third splicing result; and processing the third splicing result, the cell number and the column number of the cell in the row header based on the row header classification model, and determining the classification result of the column in the table to be processed. For more details about the above embodiment, refer to fig. 8 and its related description, which are not repeated herein. In some embodiments, the classification result of the column in the table to be processed may also be determined based on the row header classification model in other ways. For example, the row header of the table to be processed may be directly input into the row header classification model, and output as the classification result of the column in the table to be processed.

In some embodiments, the head of the list may be processed based on the head of the list classification model, and the classification result of the row in the table to be processed may be determined. In some embodiments, the cells in the list header may be spliced based on the row splicing rule to obtain a fourth splicing result; and processing the fourth splicing result, the cell number of the cell in the list head and the column number based on the list head classification model, and determining the classification result of the row in the table to be processed. For more details about the above embodiment, refer to fig. 8 and its related description, which are not repeated herein. In some embodiments, the classification result of the row in the table to be processed may also be determined based on the list head classification model in other ways. For example, the list header of the table to be processed may be directly input into the list header classification model, and output as the classification result of the row in the table to be processed.

And 240, extracting the table to be processed based on the classification result of the columns and/or rows in the table to be processed, and determining a first extraction result. In some embodiments, step 240 may be performed by the extraction module 140.

The first extraction result may refer to information of cells in the table to be processed extracted based on the classification result of the columns and/or rows in the table to be processed. In some embodiments, the extraction module 140 may extract the table to be processed according to the user requirement based on the classification result of the column and/or the row in the table to be processed, and determine a first extraction result. For example, when the user requirement is time-related table data, the table data of the third column to the fifth column in the table may be extracted when the table shown in fig. 3A is extracted based on the classification result of the column and/or the row.

In some embodiments, the table to be processed may also be extracted in other manners, for example, the table to be processed may also be extracted based on the type of words in the text of the cells in the table to be processed. For more about extracting the to-be-processed table based on the type of the word, refer to fig. 3 and the related description thereof, which are not described herein again.

Some embodiments of the present description may determine a row header or a list header in a table to be processed through a header detection model, further classify the row header or the list header through a header classification model, further extract the table to be processed, determine an extraction result, accurately locate information required by a user in the table to be processed, and improve the efficiency of table extraction.

FIG. 4 is yet another exemplary flow diagram of a form processing method according to some embodiments of the present description. In some embodiments, the flow 400 may be performed by the third determination module 150 and the cell extraction module 160. As shown in fig. 4, the process 400 may include the following steps:

step 410, processing the text of the cells in the table to be processed based on the text classification model, and determining the types of words in the text of the cells in the table to be processed. In some embodiments, step 410 may be performed by the third determination module 150.

The type of word may be preset as desired. As shown in fig. 3A, the type of "category" in the cells in the first row and the first column is a category, the type of "subject" in the cells in the first row and the second column is a category, and the type of "2030 year" in the cells in the first row and the third column is time. In some embodiments, types that are not of interest to the user may be set to others. For example, when the type required by the user is the type, "2030" in the cell in the first row and the third column in the table shown in fig. 3A may be set as the other type. In some embodiments, the type of word may be set according to actual needs.

In some embodiments, the types of words may also be divided into multiple levels. E.g., primary class, secondary class, others, etc. The second class may be a sub-class under the first class, and the others may be general terms of information not needed by the user.

In some embodiments, the words of the cells of the table to be processed may be purely numeric (as shown in FIG. 3A, the text of a cell may be "8000") and do not have any meaning in itself when the text of the cell exists alone in numeric form. Thus, the pure numeric portion in a cell may be omitted when determining the type of word in the text of the cell.

In some embodiments, the input to the text classification model may be the text of a cell, and the output is the type of individual words in the cell. For example, if the text content of a certain cell in the table to be processed is "number of supermarket is counted and is 5", wherein the type of the word "supermarket" is name, "the type of" supermarket "is" other, "" type of statistics "is other," type of "number" is attribute, "type of 5" is "number".

In some embodiments, the text classification model may include an encoding layer and a third classification layer. The coding layer may code characters in the text of the cells in the table to be processed, and determine coding vectors of the characters. The input to the encoding layer may be the text of a cell and the output may include an encoded vector of characters in the text. The third classification layer may analyze the encoded vectors of the characters to determine the type of words in the text of the cell. The input of the third classification layer can be the coding vector of characters in the text, and the output can be the type of words in the text. In some embodiments, the encoding layer in the text classification model and the encoding layer in the header classification model may be the same (i.e., common) or different.

In some embodiments, the text classification model may label the type of each character in the input cell words via a bio (begin Inside out) labeling mode. Wherein B represents the beginning character in the word; i represents the middle character in the word; o denotes the other characters in the word.

In some embodiments, the text classification model may be a transducer-based two-way coded representation-Named Entity Recognition (BERT-NER) model.

In some embodiments, a text classification model may be obtained based on historical tabular data training. When the text classification model is a BERT-NER model, the coding layer in the text classification model can be obtained in a plurality of ways (for example, obtained through a network) through pre-training, so that during training, only the third classification layer of the text classification model can be trained based on a plurality of training samples, and after the third classification layer of the text classification model is trained, the trained text classification model is obtained. The text of the cells in the historical table data can be input into the coding layer of the initial text classification model to obtain the coding vectors corresponding to the characters in the text of the cells in the historical table data, and the coding vectors of the characters in the text are used as training samples. The labels of the training samples may be the types of words in the text of the cells in the historical table data. The training samples may also be other corpora rather than from tables. Inputting a training sample with a label into a third classification layer of the initial text classification model, updating parameters of the third classification layer of the initial text classification model through training, finishing the training when the third classification layer of the trained text classification model meets preset conditions, and obtaining the trained text classification model.

And step 420, extracting the table to be processed based on the type of the words in the text of the cells in the table to be processed, and determining a second extraction result.

The second extraction result may refer to the table content after extracting the cells in the table to be processed based on the type of the word. In some embodiments, the second extraction result may be determined by extracting the table to be processed based on the user requirement based on the type of the word of the cell in the table to be processed. For example, the content of a certain cell in the table is "the number of supermarkets is counted and is 5", and the type of the word required by the user is the location, so that the words in the cell can be extracted, and the content of the extracted cell is "supermarket".

In some embodiments, the table to be processed may be extracted based on the type of the word in the text of the cell in the table to be processed and the classification result of the column and/or the row, so as to obtain a third extraction result of the table to be processed. The third extraction result may be table content obtained by extracting the table to be processed based on the type and the classification result of the columns and/or rows of the words in the text of the cells in the table to be processed. For example, the cells in the table obtained by the first extraction result may be processed to determine the types of words in the text of the cells in the table; and extracting the contents in the cells in the table based on the types of the words in the text of the cells in the table, and determining a third extraction result.

In some embodiments, the extraction mode may be selected as desired or according to the type of table. For example, if all cells in the table except the header are numerical values, the cells may be extracted based on only the method corresponding to fig. 2. If the cells in the table have both numeric values and characters or letters, the cells can be extracted based on the method shown in fig. 2 and 3.

For more contents extracted from the table based on the word type in the text of the cell, refer to steps 410 to 420 and related descriptions thereof, and for more contents extracted from the table to be processed based on the classification result of the column and/or row, refer to steps 210 to 240 and related descriptions thereof, which are not described herein again.

In some embodiments of the present disclosure, the type of the word of the cell may be determined by a text classification model, and then the second extraction result may be determined based on the type of the word. By means of the arrangement, the content of the cells in the table to be processed can be extracted, the extraction result is determined, and the content required by the user is obtained. In some embodiments of the present description, a table to be processed may be extracted based on the type of the word of the cell in the table to be processed and the classification result of the column and/or the row, so as to obtain a third extraction result of the table to be processed, and the table to be processed may be extracted twice, so that the obtained table content is more refined and is convenient for a user to view. Meanwhile, in some embodiments of the present description, the text classification model is also set as a BERT-NER model, and since the word coding layer of the BERT-NER model may be pre-processed and may be obtained in various ways, only the word classification layer of the BERT-NER model may be trained, thereby reducing the number of training samples and reducing the training difficulty.

FIG. 5 is an exemplary flow diagram illustrating the determination of a table to be processed according to some embodiments of the present description. In some embodiments, flow 500 may be performed by acquisition module 110. As shown in fig. 5, the process 500 may include the following steps:

step 510, obtain the initial table and its title.

The initial table may refer to an initial table in which it is necessary to determine whether or not information of interest of the user is contained. The initial table may contain various information, which may or may not be of interest to the user. For example, the initial tables may include financial statements, personnel statements, annual statements, and the like. The financial statements may include balance sheets, profit sheets, cash flow sheets, owner equity change sheets, financial statement notes, and the like. In some embodiments, the initial table may be obtained in a variety of ways. For example, the initial table may be a table generated by the user itself and stored in the processor, and may be directly obtained from the processor. As another example, the initial form may be obtained from a network.

The title may be the name of the file when the initial table was stored. For example, if the file name of an initial table is "2031-year operation status statistics", the "2031-year operation status statistics" is the title of the initial table. In some embodiments, when an initial form is retrieved, the title of the initial form may be retrieved at the same time.

Step 520, processing the initial form and the text in the title thereof based on the form classification model, and determining the classification result of the initial form.

The classification result of the initial table may refer to a result of classifying the initial table. The classification of the initial table may be determined according to user requirements, for example, the initial table may be classified into finance, personnel, and the like according to the user requirements based on the fields involved in the initial table, and further, for example, the initial table may be classified into 2030 table, 2031 table, 2032 table, and the like according to the user requirements based on the time involved in the initial table.

In some embodiments, the classification result of the initial table may include a positive sample table and a negative sample table, wherein the positive sample table may refer to a table satisfying the user requirement, and the negative sample table may refer to a table not satisfying the user requirement. For example, in a certain business scenario, the form of the user's requirement is the three major financial forms, so the positive sample form in the initial form may be the balance sheet, the profit sheet and the cash flow sheet, and the negative sample may be other forms besides the balance sheet.

In some embodiments, the title of the initial table and the text in the initial table may be preprocessed to obtain a preprocessed text. For example, the title of the initial table and the text in the initial table may be spliced to obtain a splicing result, and the splicing result is used as the preprocessed text. A special identifier "[ SEP ]" may be added to the title of the initial table and the text in the initial table for splicing to distinguish the contents of each part in each splicing result. As shown in fig. 3A, the name of the table is "statistics of business situation in 2031 year", and the title of the table and the table are spliced to obtain a splicing result of "statistics of business situation in 2031 year [ SEP ] category; the first half year; the next half year; subject of the study; income; sales revenue; other revenues; paying out; tax payment; payroll ".

It should be understood that the specific meaning of the numerical value cannot be expressed only according to the numerical value in each cell in the initial table, and therefore, when the text in the initial table is preprocessed, the numerical value content in each cell may be omitted.

In some embodiments, the preprocessed text may be input to a form classification model, the output of which may include the classification results of the initial form. For example, a table is a positive sample table or a negative sample table.

In some embodiments, the number of words of the text entered by the table classification model should be less than a first threshold. The first threshold may refer to the number of words that the form classification model allows to input text, e.g. the first threshold may be 100, i.e. the number of words of the text representing the concatenation result of the form classification model input needs to be less than 100 words.

In some embodiments, the number of words in the text input by the table classification model that are taken from the title should be less than the second threshold. The second threshold may refer to the number of words that the form classification model allows to enter the title in the text, for example, the second threshold may be 20, i.e. the number of words from the title in the concatenation result representing the input of the form classification model needs to be less than 20 words.

In some embodiments, the table classification model may be a transformer-based Bidirectional encoding characterization (BERT) model. Correspondingly, when the table classification model is the BERT model, the first threshold may be the maximum number of characters allowed to be input by the BERT model. For example, when the table classification model is a BERT model, the BERT model allows the maximum number of characters to be input to be 512, and correspondingly, the first threshold is 512.

In some embodiments, when the table classification model is a BERT model, the table classification model may be obtained based on historical table data training. The BERT model may include a vectorization part and a task processing part, wherein the vectorization part of the BERT model may be obtained by pre-training. Therefore, during training, only the task processing part of the table classification model can be trained based on a plurality of training samples, and after the task processing part of the table classification model is trained, the trained table classification model is obtained. The historical table data can be preprocessed to obtain a preprocessed text, the preprocessed text is input into a vectorization part of the initial table classification model to obtain vectorized historical table data, and the vectorized historical table data is used as a training sample. The labels of the training samples can be classification results corresponding to historical table data, and the classification results can be obtained based on manual labeling. Inputting a training sample with a label into a task processing part of the initial form classification model, updating parameters of the task processing part of the initial form classification model through training, finishing the training when the task processing part of the trained form classification model meets preset conditions, and obtaining the trained form classification model.

In some embodiments, the table classification model may also be other types of models, for example, the table classification model may include, but is not limited to, a support vector machine model, a Logistic regression model, a naive bayes classification model, a gaussian distributed bayes classification model, a decision tree model, a random forest model, a KNN classification model, a neural network model, and the like.

Step 530, when the classification result of the initial table meets a first preset condition, determining the initial table as a table to be processed.

The first preset condition may represent a condition satisfying a user requirement in the initial table. In some embodiments, the first preset condition may refer to that the classification result of the initial table is a positive sample table. For example, the form required by the user is a financial three-large report, and when the initial form contains the contents of the balance sheet, the profit sheet and the cash flow sheet, the classification result of the initial form may be determined to be a positive sample form, so as to confirm that the classification result of the initial form satisfies the first preset condition. The obtaining module 110 may determine the initial table meeting the first preset condition as the table to be processed.

In some embodiments, after the classification result of the initial table meets the first preset condition, the initial table may be further cleaned, redundant rows and/or columns in the initial table are deleted, and the cleaned initial table is determined as the table to be processed. For more details on the cleaning of the initial table, refer to fig. 2 and the related description thereof, which are not repeated herein.

In some embodiments, when the classification result of the initial table does not satisfy the first preset condition, the initial table is filtered out, and subsequent processing is not performed. For example, the table required by the user is the finance three-large report, the initial table a is the annual report, the initial table B is the owner's equity change table, and the contents of the tables are independent of the finance three-large report, so the classification results of the initial table a and the initial table B are negative sample tables, and do not satisfy the first preset condition. The obtaining module 110 may filter the initial table a and the initial table B that do not satisfy the first preset condition without performing subsequent processing on them.

Some embodiments of the present description may determine a classification result of the initial table through a table classification model, and further determine the positive sample table as the table to be processed, so as to accurately locate the table to be processed, and filter out redundant negative sample tables that are not concerned by the user. In addition, in some embodiments of the present description, the table classification model is set as a BERT model, and since the vectorization part of the BERT model is pre-processed and can be obtained in multiple ways, only the task processing part of the BERT model can be trained, so that the number of training samples is reduced, and the training difficulty is reduced. Meanwhile, the initial form is processed based on the BERT model, so that the meaning of the input text can be more accurately learned, the initial form is accurately classified, and the form to be processed is determined.

FIG. 6 is a schematic diagram illustrating detection of a linehead in accordance with some embodiments of the present description. In some embodiments, the flow 600 may be performed by the first determination module 120. As shown in fig. 6, the process 600 includes the following steps:

and 610, splicing the cells of the rows in the table to be processed based on the row splicing rule to obtain a first splicing result.

In some embodiments, the first determining module 120 may stitch cells of a row in the table to be processed based on the row stitching rule to obtain a first stitching result.

The row-splicing rule may refer to a rule for splicing cells based on the order of rows. In some embodiments, the row-stitching rules may include stitching cells of different rows based on the order of the rows, and for each row, stitching cells of the row based on the order of the cells in the row. For example, the line splicing rule is to splice each line from left to right, and then splice the splicing results of different lines from top to bottom. In some embodiments, the row-splicing rule further includes a number of splices to merge the cells. For example, the merged cells are only tiled 1 time.

For example, as shown in fig. 3A, the first 3 rows of the table are spliced based on the row splicing rule, and the obtained splicing result is "category; subject of the study; 2030; 2031 year; the first half year; the next half year; income; sales revenue; 15000; 8000; 9000".

The first stitching result may refer to a result of stitching cells of a row in the to-be-processed table based on the row stitching rule.

In some embodiments, the first determining module 120 may splice the first m rows of the table to be processed based on the row splicing rule, where the table to be processed contains n rows, and m is less than or equal to n. In some embodiments, m may be preset (e.g., m is 2, etc.). For example, as shown in fig. 3A, the first three rows are spliced, and the first splicing result may be a "category; subject of the study; 2030; 2031 year; the first half year; the next half year; income; sales revenue; 15000; 8000; 9000".

And step 620, processing the first splicing result based on the row header detection model, and determining the row header of the table to be processed.

In some embodiments, the first concatenation result corresponding to the table to be processed may be input into the row header detection model, and the output of the row header detection model may be a probability that each row in the input row belongs to the row header. Further, the first determination module 120 may determine the head of the row based on an output of the head of the row detection model. For example, a row with a probability greater than a probability threshold is taken as the row header. For another example, the row header is further determined based on the row having a probability greater than a probability threshold determining a starting row of the row header. It is understood that the output of the row header detection model may also be in other forms, such as directly outputting the row number of the row header, and the like.

As shown in fig. 3A, when the first 3 rows of the table are spliced based on the row splicing rule, the corresponding first splicing result is "category; subject of the study; 2030; 2031 year; the first half year; the next half year; income; sales revenue; 15000; 8000; 9000 ", the first concatenation result is input to the head of line detection model, and the output of the head of line detection model may be [0,1], i.e., the head of line representing the first line to the second line in the table. It should be understood that when the processor processes a table, the row index starts at 0, so [0,1] represents the first to second rows of the table.

In some embodiments, the row header detection model may include a feature embedding layer, a first sequence layer, a first fusion layer, a second sequence layer, and a first classification layer connected in sequence. Correspondingly, step 620 may include the steps of:

step 621, determine a first feature vector of a cell in a row of the table to be processed based on the first splicing result.

In some embodiments, the feature embedding layer may be configured to determine a first feature vector of a cell in the table to be processed based on a result of the stitching of the cell in the table to be processed. For example, the input of the feature embedding layer is the first splicing result, and the output is the first feature vector of the cell of the row in the table to be processed.

The first feature may include a location feature and a text feature of the cell. Correspondingly, the first feature vector may be vector information characterizing the location feature and the text feature of the cell. The location characteristics of the cells may characterize the location of the cells in the table to be processed. In some embodiments, the location characteristic of the cell may be [ a, b, c, d ], a representing the row of the cell in the table to be processed, b representing the column of the cell in the table to be processed, c representing whether the cell spans a row, and d representing whether the cell spans a column. For example, the location feature vector of a cell is [0,0,1,1], which means that the cell is in the 1 st row, 1 st column, 1 crossing row and 1 crossing column in the table to be processed. The text features may characterize the text content in the cells in the table to be processed.

In some embodiments, the types of feature Embedding layers may include an Embedding layer (Embedding layer), a Linear transformation layer (Linear layer), and the like. The first determining module 120 may determine a first feature vector of a cell in the table to be processed through the location feature and the text feature of the cell based on the embedding layer and the linear transformation layer. For example, in the Embedding layer, 4 position information in the position features of the cells are represented by 1 Embedding vector respectively, and 4 Embedding vectors corresponding to the 4 position information obtain feature vectors representing the position information of the cells. For another example, the text in the cell is vector-represented in the embedding layer, and a feature vector representing the text information is obtained. And performing linear transformation on the feature vector representing the position information of the cell and the feature vector representing the text information through a linear transformation layer to ensure that the vector dimensions of the two vectors are the same. And adding the feature vector representing the position information of the cell and the feature vector representing the text information in the same dimension based on vector matrix addition to obtain a first feature vector of the cell.

At step 622, a second feature vector of the cell of the row in the table to be processed is determined based on the first feature vector of the cell of the row in the table to be processed.

In some embodiments, the first sequence layer may be configured to determine a second feature vector for a cell in the table to be processed based on the first feature vector for the cell in the table to be processed. For example, the input of the first sequence layer is the first feature vector of the cell of the row in the table to be processed, and the output is the second feature vector of the cell of the row in the table to be processed.

The second feature vector refers to a feature vector that can represent relationship information between cells input before and after, and is a feature vector that fuses relationship information between cells before and after. For example, semantic information between cells is represented, etc.

In some embodiments, the first sequence layer may be implemented based on a Long Short-Term Memory network (LSTM). The first sequence layer may also be other sequence models, e.g., RNN, etc.

Step 623, determining a third feature vector of the row in the table to be processed based on the second feature vector of the cell of the row in the table to be processed.

In some embodiments, the first fusion layer may be configured to determine a third feature vector of a row or a column in the table to be processed based on the second feature vector of the cell in the table to be processed. For example, the input of the first fusion layer is the second feature vector of the cell of the row in the table to be processed, and the output is the third feature vector of the row.

The third feature vector refers to a feature vector that can characterize a certain row or a certain column in the table to be processed, in other words, a vector generated by fusing feature information of all cells in a certain row or a certain column with the third feature vector. For example, when a first row of a table to be processed includes a second feature vector of 4 cells, correspondingly, a third feature vector of the first row may represent a feature vector of information of 4 cells. For another example, a first column of a table to be processed includes the second feature vectors of 3 cells, and a third feature vector of the first column may represent a feature vector of information of the second feature vectors of 3 cells.

In some embodiments, the type of the first fusion layer may be a pooling layer. The pooling layer may pool the cells in each row or each column, and obtain the third feature vector in each row or each column by maximal pooling or average pooling, etc. For example, the second feature vectors of the cells of a certain row are merged into a third feature vector of the row by the pooling layer.

Step 624, determine the fourth feature vector of the row in the table to be processed based on the third feature vector of the row in the table to be processed.

In some embodiments, the second sequence layer may be configured to determine a fourth feature vector for a row in the table to be processed based on the third feature vector for the row in the table to be processed. The second sequence layer may be to determine a fourth feature vector of a column in the table to be processed based on the third feature vector of the column in the table to be processed. For example, the input of the second sequence layer may be a third feature vector of rows and the output may be a fourth feature vector of rows.

The fourth feature vector refers to a feature vector that can represent a relationship information relationship between a front row and a back row or between a front column and a back column, in other words, the fourth feature vector fuses information of correlation of the front row and the back row or the front column. For example, the relationship information may include semantic information or the like.

In some embodiments, the second sequence layer may be implemented based on LSTM. The second sequence layer may also be other sequence models, such as RNN, etc.

Step 625, determining the row header of the table to be processed based on the fourth feature vector of the row in the table to be processed.

In some embodiments, the first classification layer may be configured to determine a row header of the table to be processed based on a fourth feature vector of a row in the table to be processed. For example, the input of the first classification layer is the fourth eigenvector of the row, and the output is the row header. For more forms of output, see the associated description of step 620.

In some embodiments, the first classification layer may include, but is not limited to, a support vector machine model, a Logistic regression model, a naive bayes classification model, a gaussian distributed bayes classification model, and the like.

In some embodiments, the parameters of the feature embedding layer, the first sequence layer, the first fusion layer, the second sequence layer, and the first classification layer in the row header detection model may be obtained by joint training. The historical table data can be processed based on the row splicing rule, first splicing results of a plurality of samples corresponding to the historical table data are obtained, and the first splicing results of the plurality of samples are used as training samples. The labels of the training samples are row headers of the historical table data. In some embodiments, a training sample may be input into a feature embedding layer in an initial row header detection model to obtain an output of the feature embedding layer, the output of the feature embedding layer is input into a first sequence layer to obtain an output of the first sequence layer, the output of the first sequence layer is input into a first fusion layer to obtain an output of the first fusion layer, the output of the first fusion layer is input into a first classification layer to obtain an output of the first classification layer, a loss function is constructed based on the output of the first classification layer and a label, and parameters of each layer in the initial row header detection model are iteratively updated simultaneously based on the loss function until a preset condition is met and training is completed to obtain a trained row header detection model.

The parameters of each layer in the line header detection model are obtained through the training mode, and the problem that labels are difficult to obtain when each layer in the line header detection model is trained independently is solved.

FIG. 7 is a schematic diagram illustrating detection of a list header in accordance with some embodiments of the present description. In some embodiments, the flow 700 may be performed by the first determination module 120. As shown in fig. 7, the process 700 may include the following steps:

and 710, splicing the cells of the columns in the table to be processed based on the column splicing rule to obtain a second splicing result.

The column-splicing rule may refer to a rule for splicing cells based on the order of columns. In some embodiments, the column-stitching rules may include stitching cells of different columns based on the order of the columns, and for each column, stitching cells of the column based on the order of the cells at the column. For example, the column splicing rule is that for each column, splicing is performed from top to bottom, and then splicing results of different columns are performed from left to right. In some embodiments, the column-splicing rule further includes a number of splices to merge the cells. For example, the merged cells are only tiled 1 time. For example, as shown in fig. 3A, the first 2 columns of the table are spliced based on the column splicing rule, and the obtained splicing result is "category; income; paying out; subject of the study; sales revenue; other revenues; tax payment; payroll ".

The second splicing result may refer to a result of splicing cells of a column in the to-be-processed table based on the column splicing rule.

The content of step 710 is similar to that of step 610, except that step 710 processes the cells in the columns, and step 610 processes the cells in the rows, so more about step 710 refers to step 610, and is not described herein again.

And 720, processing the second splicing result based on the list head detection model, and determining the list head of the to-be-processed list.

In some embodiments, the second concatenation result corresponding to the table to be processed may be input into the list head detection model, and the output of the list head detection model may be the list head of the table to be processed. In some embodiments, the list head detection model may also include a feature embedding layer, a first sequence layer, a first fusion layer, a second sequence layer, and a first classification layer connected in sequence. Correspondingly, step 720 may also include the following steps:

step 721, determining a first feature vector of a cell in the list to be processed based on the second splicing result. In some embodiments, the input of the feature embedding layer is the second stitching result, and the output is the first feature vector of the cell of the column in the table to be processed.

Step 722, determining a second feature vector of the cell in the column of the table to be processed based on the first feature vector of the cell in the column of the table to be processed. In some embodiments, the input of the first sequence layer is a first feature vector of a cell of a column in the table to be processed, and the output is a second feature vector of a cell of a column in the table to be processed.

Step 723, determining a third feature vector of the column in the table to be processed based on the second feature vector of the cell of the column in the table to be processed. In some embodiments, the input of the first fusion layer may be a second feature vector of a cell of a column in the table to be processed, and the output may be a third feature vector of the column.

Step 724, determining a fourth feature vector of the column in the table to be processed based on the third feature vector of the column in the table to be processed. In some embodiments, the input of the second sequence layer may be a third eigenvector of the column and the output is a fourth eigenvector of the column.

Step 725, determining the list head of the table to be processed based on the fourth feature vector of the column in the table to be processed. In some embodiments, the first classification layer may be configured to determine a column header of the table to be processed based on a fourth feature vector of a column in the table to be processed. For example, the input of the first classification layer may be the fourth feature vector of the column, and the output is the head of the list.

The content of step 720 is similar to the content (model content, execution mode, training process) of step 620, and the difference is only that step 720 processes the columns in the table to be processed and determines the list header, and step 620 processes the rows in the table to be processed and determines the row header, so more about step 720 refer to step 620, which is not described herein again.

In some embodiments of the present description, a row header of a table to be processed may be determined by a row header detection model, and a list header of the table to be processed may be determined by a list header detection model. The row header detection model and the list header detection model respectively comprise a feature embedding layer, a first sequence layer, a first fusion layer, a second sequence layer and a first classification layer which are sequentially connected, information of different cells of a table to be processed and relations among the cells can be obtained through the row header detection model and the list header detection model, relations among different rows or columns and front and back rows or columns are also obtained through the row header detection model and the list header detection model, and then row headers and list headers are determined, so that the row headers and the list headers of the table to be processed can be accurately positioned, and the accuracy of extraction results is determined.

FIG. 8 is a diagram illustrating a determination of a classification result for a column according to some embodiments of the present description. In some embodiments, the flow 800 may be performed by the second determination module 130. As shown in fig. 8, the process 800 may include the following steps:

and 810, splicing the cells in the row header based on the column splicing rule to obtain a third splicing result.

The third stitching result may refer to a result of stitching cells in the row header based on the column stitching rule. As shown in fig. 3A, the head of the row corresponding to the table is a first row to a second row, and the cells in the first row to the second row are spliced based on the column splicing rule, so that a third splicing result is obtained and is a category; subject of the study; 2030; 2031 year; the first half year; the next half year ". For more details on the column splicing rule, refer to fig. 7 and the related description thereof, which are not repeated herein.

And 820, processing the third splicing result, the cell number of the cell in the row header and the column number based on the row header classification model, and determining the classification result of the column in the table to be processed.

The cell number refers to a number that can represent the position information of each cell in the table to be processed. The column number is a number that can indicate each column position information in the table to be processed. Correspondingly, the line number is a number which can represent position information of each line in the table to be processed.

In some embodiments, the rows and columns in the table to be processed may be numbered by a preset rule. For example, the rows in the table to be processed may be numbered A, B, C, … from top to bottom, and the columns may be numbered 0,1, 2, … from left to right.

In some embodiments, the cell number corresponding to the cell may be determined based on the row and column to which the cell corresponds. For example, when the rows in the table to be processed are numbered A, B, C, … in sequence from top to bottom, the columns are numbered 0,1, 2, … in sequence from left to right, and the table to be processed is the table shown in fig. 3A, the numbers corresponding to "category" are a to B0, the numbers corresponding to "subject" are a to B1, the numbers corresponding to "payout" are E to F0, the numbers corresponding to "first half year" are B3, and the number corresponding to "payroll" is F1. In some embodiments, when the cell is a merged cell, the row number corresponding to the cell may also be the number corresponding to the first row in the merged row or the column number corresponding to the cell may also be the number corresponding to the first column in the merged row. For example, when the table to be processed is the table shown in fig. 3A, the number corresponding to "year 2031" may be a3, and the number corresponding to "income" may be C0. In some embodiments, the cells may also be numbered in other manners, for example, directly starting from the cell table corresponding to the first row and the first column, the cell table corresponding to the first row and the first column is numbered 1, and is numbered sequentially from left to right, the cell table corresponding to the second row and the second column is numbered 2, and the cell table corresponding to the third column and the first row is numbered 3 and …, and when the cell number of the first row is completed, the cell table corresponding to the first column and the second row is numbered again until all the cell numbers in the table to be processed are completed.

In some embodiments, the third concatenation result, the cell number of the cell in the row header, and the column number may be input into a row header classification model, and the output of the row header classification model may be the classification result of the column in the table to be processed.

In some embodiments, the row header classification model may include an encoding layer, a second fusion layer, an attention layer, a third fusion layer, and a second classification layer connected in sequence. Correspondingly, step 820 may include the steps of:

and step 821, based on the third splicing result, coding the characters of the cells in the row header to obtain the coding vectors of the characters.

A character may be any text in a cell, such as a word, a numeric value, etc.

In some embodiments, the encoding layer may be configured to encode the characters of the cells in the row header or the list header based on the splicing result of the cells in the header to obtain an encoding vector of the characters in the row header or the list header. For example, the input of the encoding layer may be the third concatenation result, and the output is an encoding vector of characters in the head of the row table.

In some embodiments, the encoding layer may be a BERT model. In some embodiments, the coding layer may be pre-trained.

Step 822, determining the coding vector of the cell based on the cell number of the cell in the row header and the coding vector of each character in the cell.

In some embodiments, the second fusion layer may be used to determine the code vector for the cell in the row header based on the cell number of the cell in the row header and the code vectors for the individual characters in the cell. For example, the second fusion layer is input as the cell number of the cell in the head of the row table and the encoding vector of each character in the cell, and is output as the encoding vector of the cell in the head of the row table.

The code vector of a cell refers to a code vector that can represent text information in the cell.

The second fusion layer can perform fusion processing on the coding vectors of the characters in the same cell based on the cell number of the cell to obtain the coding vector of the cell. In some embodiments, the second fused layer may be a pooling layer, and accordingly, the fusing process is a pooling process, e.g., by maximum pooling or average pooling, etc.

In step 823, the cell fusion feature vector of the cell in the row header is determined based on the coding vector of the cell in the row header and the column number of the cell.

In some embodiments, the attention layer may be used to determine a cell fusion feature vector for a cell in the row header based on the coding vector for the cell and the column number of the cell in the row header. For example, the attention layer inputs the coding vector of the cell in the row header and the column number of the cell, and outputs the cell fusion feature vector of the cell.

A cell fusion feature vector for a cell in the row header may refer to a feature vector that contains relationships with other cells in the same column in the row header. For example, the relationship may be relevance or attention information, or the like.

In some embodiments, when the row header is processed, the attention layer may obtain a cell fusion feature vector of each cell based on the attention size between cells located in the same column in the row header, where the cell fusion feature vector includes a relationship with other cells located in the same column of the cell, that is, the cell fusion feature vector may fuse information with other cells located in the same column.

As shown in fig. 10A, fig. 10A is a row header of a table, and the row header includes 4 cells, a cell number of "2030" is 1, a cell number of "2031" is 2, a cell number of "first half year" is 3, and a cell number of "second half year" is 4. Fig. 10B shows the mask array corresponding to the head of the row, where "1", "2", "3" and "4" represent different cell numbers. In the mask matrix, black circles represent cells visible with respect to the cell, and white circles represent cells invisible with respect to the cell. When the row header is classified, the cells in the same column are visible cells. As shown in fig. 10B, the cell with the cell number 1 visible in the row header shown in fig. 10A is only its own cell, that is, the cell with the cell number 1, the cell with the cell number 2 has the cells with the

cell numbers

2, 3, and 4 visible, the cell with the cell code 3 has the cells with the

cell codes

2 and 3 visible, and the cell with the cell code 4 has the cells with the

cell codes

2 and 4 visible. The attention layer can determine the attention size among the cells in the same column in the mask array mode when the row header is classified, further fuse the information of the visible cells, calculate the attention size among the cells, and further determine the cell fusion feature vector in the row header.

Step 824, determining a column fusion feature vector of a column corresponding to the row header based on the cell fusion feature vector of the cell in the row header.

In some embodiments, the third fusion layer may be configured to determine a column fusion feature vector for a column corresponding to the row header based on the cell fusion feature vectors for the cells in the row header. For example, the third fusion layer inputs cell fusion feature vectors of cells in the row header and outputs column fusion feature vectors of columns corresponding to the row header.

It should be understood that each cell in the head of a row in the table to be processed is used for classifying each column in the table to be processed, and each cell in the head of a column in the table to be processed is used for classifying each row in the table to be processed. Therefore, when the row header is subjected to the classification processing, the column fusion feature vector of the column corresponding to the row header is specified, and when the list header is subjected to the classification processing, the row fusion feature vector of the row corresponding to the list header is specified.

A column fusion feature vector refers to a feature vector that can represent cells of the same column in a row header.

When the row header is classified, the third fusion layer may perform fusion processing on the feature vectors of the cells in the same column in the row header to obtain a column fusion feature vector of the column corresponding to the row header. In some embodiments, the third fused layer may be a pooling layer. Accordingly, the fusion process may be a pooling process, e.g., by maximal pooling or average pooling, etc.

Step 825, determining a classification result of the column based on the column fusion feature vector of the column corresponding to the row header.

In some embodiments, the second classification layer may be configured to determine a classification result for a column based on the column fusion feature vector of the column corresponding to the row header. For example, the second classification layer may input a column fusion feature vector of a column corresponding to the row header and output a classification result of the column.

In some embodiments, the second classification layer may include, but is not limited to, a support vector machine model, a Logistic regression model, a naive bayes classification model, a gaussian distributed bayes classification model, and the like.

In some embodiments, in the row header classification model, the parameters of the coding layer, the second fusion layer, the attention layer, the third fusion layer, and the second classification layer may be obtained by joint training. The row table header in the historical table data can be processed based on the column splicing rule, a third splicing result of a plurality of samples corresponding to the historical table data is obtained, and the third splicing result of the plurality of samples is used as a training sample. The labels of the training samples are classification results of all columns in the historical table data. In some embodiments, a training sample may be input to an encoding layer in the initial row header classification model to obtain an output of the encoding layer, the output of the encoding layer is input to a second fusion layer to obtain an output of the second fusion layer, the output of the second fusion layer is input to an attention layer to obtain an output of the attention layer, the output of the attention layer is input to the second classification layer to obtain an output of the second classification layer, a loss function is constructed based on the output of the second classification layer and a label, and parameters of each layer in the initial row header classification model are iteratively updated based on the loss function at the same time until a preset condition is met and training is completed to obtain a trained row header classification model.

The parameters of all layers in the row header classification model are obtained through the training mode, and the problem that labels are difficult to obtain when all layers in the row header classification model are trained independently is solved.

FIG. 9 is a schematic diagram of determining classification results for rows in a table to be processed according to some embodiments of the present description. In some embodiments, the flow 900 may be performed by the second determination module 130. As shown in fig. 9, the process 900 may include the following steps:

and step 910, splicing the cells in the list header based on the row splicing rule to obtain a fourth splicing result.

The fourth splicing result may refer to a result of splicing the cells in the list header based on the row splicing rule. For more details on the line splicing rule, refer to fig. 6 and the related description thereof, which are not repeated herein.

And 920, processing the fourth splicing result, the cell number and the row number of the cell in the list head based on the list head classification model, and determining the classification result of the row in the table to be processed.

In some embodiments, the fourth concatenation result, the cell number of the cell in the head of the list, and the row number may be input into the head of the list classification model, and output as a classification result of a row in the table to be processed. In some embodiments, the list head classification model also includes a coding layer, a second fusion layer, an attention layer, a third fusion layer and a second classification layer which are connected in sequence. Correspondingly, step 920 may also include the following steps:

and step 921, based on the fourth splicing result, encoding the characters of the cells in the list header to obtain an encoding vector of the characters. In some embodiments, the input of the encoding layer may be the fourth concatenation result, and the output is an encoding vector of characters in the list header.

Step 922, determining the coding vector of the cell based on the cell number of the cell in the list head and the coding vector of each character in the cell. In some embodiments, the second fusion layer may be used to determine the code vector for the cell in the list header based on the cell number of the cell in the list header and the code vectors for the individual characters in the cell. For example, the second fusion layer may be input as the cell number of the cell in the list header and the code vector of each character in the cell, and output as the code vector of the cell in the list header.

Step 923, based on the coding vector of the cells in the list head and the line numbers of the cells, determining the cell fusion feature vector of the cells in the list head. In some embodiments, the attention layer may be used to determine a cell fusion feature vector for a cell in the list head based on the encoding vector for the cell in the list head and the row number of the cell. For example, the attention layer inputs the coding vector of the cell in the head of the list and the column number of the cell, and outputs the cell fusion feature vector of the cell. The cell fusion feature vector of the cell in the list head may refer to a feature vector containing a relationship with other cells located in the same row in the list head.

Step 924, determining the row fusion feature vector of the row corresponding to the list header based on the cell fusion feature vector of the cell in the list header. In some embodiments, the third fusion layer may be configured to determine a row fusion feature vector for a row corresponding to the list head based on the cell fusion feature vector for the cell in the list head. For example, the input of the third fusion layer may be a cell fusion feature vector of a cell in the list header, and the output may be a row fusion feature vector of a row corresponding to the list header. Wherein, the line fusion feature vector refers to a feature vector that can represent cells of the same line in the list header.

And step 925, determining a classification result of the row based on the row fusion feature vector of the row corresponding to the list head. In some embodiments, the second classification layer may be configured to determine a classification result for a row based on the row-fused feature vector of the row corresponding to the list header. For example, the second classification layer may input a row fusion feature vector of a row corresponding to the list header, and output a classification result of the row.

The content of step 920 is similar to the content (model content, execution mode, training process) of step 820, and the difference is only that step 920 determines the classification result of the rows in the table to be processed, and step 820 determines the classification result of the columns in the table to be processed, so more content about step 920 is referred to step 820, and details are not repeated here.

In some embodiments of the present description, the classification result of a column in the table to be processed may be determined by the row header classification model, and the classification result of a row in the table to be processed may be determined by the list header classification model. The row header classification model and the list header classification model respectively comprise a coding layer, a second fusion layer, a self-attention layer, a third fusion layer and a second classification layer which are sequentially connected, the fusion characteristics of the same column of cells in the row header of the table to be processed can be obtained through the multilayer structure of the row header classification model, the fusion characteristics of the same row of cells in the list header of the table to be processed can also be obtained through the row header classification model, the classification results of the row and the column are further determined, the row or the column in the table to be processed can be accurately classified, the accuracy of extraction of the table to be processed is improved, and the accuracy of the extraction result is determined.

Some embodiments of the present specification also disclose a table processing apparatus comprising at least one processor and at least one memory; the at least one memory is for storing computer instructions; the at least one processor is configured to execute at least a portion of the computer instructions to implement the form processing method as described in any of the preceding embodiments.

Some embodiments of the present specification also disclose a computer-readable storage medium that may store computer instructions that, when executed by a processor, implement a table processing method as described in any of the preceding embodiments.

Having thus described the basic concept, it will be apparent to those skilled in the art that the foregoing detailed disclosure is to be regarded as illustrative only and not as limiting the present specification. Various modifications, improvements and adaptations to the present description may occur to those skilled in the art, although not explicitly described herein. Such modifications, improvements and adaptations are proposed in the present specification and thus fall within the spirit and scope of the exemplary embodiments of the present specification.

Also, the description uses specific words to describe embodiments of the description. Reference throughout this specification to "one embodiment," "an embodiment," and/or "some embodiments" means that a particular feature, structure, or characteristic described in connection with at least one embodiment of the specification is included. Therefore, it is emphasized and should be appreciated that two or more references to "an embodiment" or "one embodiment" or "an alternative embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, some features, structures, or characteristics of one or more embodiments of the specification may be combined as appropriate.

Additionally, the order in which the elements and sequences of the process are recited in the specification, the use of alphanumeric characters, or other designations, is not intended to limit the order in which the processes and methods of the specification occur, unless otherwise specified in the claims. While various presently contemplated embodiments of the invention have been discussed in the foregoing disclosure by way of example, it is to be understood that such detail is solely for that purpose and that the appended claims are not limited to the disclosed embodiments, but, on the contrary, are intended to cover all modifications and equivalent arrangements that are within the spirit and scope of the embodiments herein. For example, although the system components described above may be implemented by hardware devices, they may also be implemented by software-only solutions, such as installing the described system on an existing server or mobile device.

Similarly, it should be noted that in the preceding description of embodiments of the present specification, various features are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure aiding in the understanding of one or more of the embodiments. This method of disclosure, however, is not intended to imply that more features than are expressly recited in a claim. Indeed, the embodiments may be characterized as having less than all of the features of a single embodiment disclosed above.

Numerals describing the number of components, attributes, etc. are used in some embodiments, it being understood that such numerals used in the description of the embodiments are modified in some instances by the use of the modifier "about", "approximately" or "substantially". Unless otherwise indicated, "about", "approximately" or "substantially" indicates that the number allows a variation of ± 20%. Accordingly, in some embodiments, the numerical parameters used in the specification and claims are approximations that may vary depending upon the desired properties of the individual embodiments. In some embodiments, the numerical parameter should take into account the specified significant digits and employ a general digit preserving approach. Notwithstanding that the numerical ranges and parameters setting forth the broad scope of the range are approximations, in the specific examples, such numerical values are set forth as precisely as possible within the scope of the application.

For each patent, patent application publication, and other material, such as articles, books, specifications, publications, documents, etc., cited in this specification, the entire contents of each are hereby incorporated by reference into this specification. Except where the application history document does not conform to or conflict with the contents of the present specification, it is to be understood that the application history document, as used herein in the present specification or appended claims, is intended to define the broadest scope of the present specification (whether presently or later in the specification) rather than the broadest scope of the present specification. It is to be understood that the descriptions, definitions and/or uses of terms in the accompanying materials of this specification shall control if they are inconsistent or contrary to the descriptions and/or uses of terms in this specification.

Finally, it should be understood that the embodiments described herein are merely illustrative of the principles of the embodiments of the present disclosure. Other variations are also possible within the scope of the present description. Thus, by way of example, and not limitation, alternative configurations of the embodiments of the specification can be considered consistent with the teachings of the specification. Accordingly, the embodiments of the present description are not limited to only those embodiments explicitly described and depicted herein.

Claims

1. A method of form processing, the method comprising:

acquiring a table to be processed;

processing the table to be processed based on a table header detection model, and determining a row table header and/or a list header of the table to be processed;

processing the row header based on a header classification model to determine a classification result of a column in the table to be processed, and/or processing the column header based on the header classification model to determine a classification result of a row in the table to be processed;

and extracting the table to be processed based on the classification result of the columns and/or rows in the table to be processed, and determining a first extraction result.

2. The method of claim 1, wherein the method further comprises:

processing the text of the cells in the table to be processed based on a text classification model, and determining the types of words in the text of the cells;

and extracting the table to be processed based on the type of the words in the text of the cell, and determining a second extraction result.

3. The method of claim 1, wherein the obtaining the table to be processed comprises:

acquiring an initial form and a title thereof;

processing the initial form and the text in the title thereof based on a form classification model, and determining the classification result of the initial form;

and when the classification result of the initial table meets a first preset condition, determining the table to be processed based on the initial table.

4. The method of claim 1, wherein the header detection model comprises a row header detection model and a list header detection model, and wherein the processing the table to be processed based on the header detection model and the determining the row header and/or the list header of the table to be processed comprises:

processing rows in the table to be processed based on the row header detection model, and determining the row header of the table to be processed; and/or

Processing columns in the table to be processed based on the column header detection model, and determining the column header of the table to be processed.

5. The method of claim 4, wherein determining the head of the row of the table to be processed based on the head of the row detection model processing the row in the table to be processed comprises:

based on a row splicing rule, splicing the cells of the row in the table to be processed to obtain a first splicing result;

processing the first splicing result based on the row header detection model, and determining the row header of the table to be processed;

the determining the head of the list of the table to be processed based on the head of the list detection model comprises:

based on the column splicing rule, splicing the cells of the columns in the table to be processed to obtain a second splicing result;

and processing the second splicing result based on the list head detection model, and determining the list head of the table to be processed.

6. The method of claim 5, wherein the header detection model comprises a feature embedding layer, a first sequence layer, a first fusion layer, a second sequence layer, and a first classification layer connected in sequence, wherein:

the characteristic embedding layer is used for determining a first characteristic vector of a cell in the table to be processed based on the splicing result of the cell in the table to be processed;

the first sequence layer is used for determining a second feature vector of a cell in the table to be processed based on a first feature vector of the cell in the table to be processed;

the first fusion layer is used for determining a third feature vector of a row or a column in the table to be processed based on the second feature vector of the cell in the table to be processed;

the second sequence layer is used for determining a fourth feature vector of a row in the table to be processed based on a third feature vector of the row in the table to be processed, or determining a fourth feature vector of a column in the table to be processed based on a third feature vector of the column in the table to be processed;

the first classification layer is used for determining the row header of the table to be processed based on a fourth feature vector of a row in the table to be processed, or determining the column header of the table to be processed based on a fourth feature vector of a column in the table to be processed.

7. The method of claim 6, wherein the first features comprise location features and text features of cells in the table to be processed.

8. The method of claim 1, wherein the header classification model comprises a row header classification model and a list header classification model, and the processing the row header and/or the list header based on the header classification model, and the determining the classification result of the column and/or the row in the table to be processed comprises:

processing the row header based on the row header classification model, and determining a classification result of columns in the table to be processed; and/or

And processing the list head based on the list head classification model, and determining the classification result of the rows in the table to be processed.

9. The method of claim 8, wherein the processing the row header based on the row header classification model, and wherein determining the classification result for the column in the table to be processed comprises:

based on a column splicing rule, splicing the cells in the row header to obtain a third splicing result;

processing the third splicing result, the cell number and the column number of the cell in the row header based on the row header classification model, and determining a classification result of the column in the table to be processed;

the processing the list header based on the list header classification model, and determining the classification result of the row in the table to be processed includes:

based on the row splicing rule, splicing the cells in the list header to obtain a fourth splicing result;

and processing the fourth splicing result, the cell number and the line number of the cell in the list head based on the list head classification model, and determining the classification result of the line in the table to be processed.

10. The method of claim 9, wherein the header classification model comprises a coding layer, a second fusion layer, an attention layer, a third fusion layer, and a second classification layer connected in sequence; wherein:

the coding layer is used for coding the characters of the cells in the row header or the list header to obtain the coding vectors of the characters;

the second fusion layer is used for determining the coding vector of the cell based on the coding vector of the character and the cell number of the cell in the list header or the row header;

the attention layer is used for determining a cell fusion feature vector of a cell in the list head based on the coding vector of the cell in the list head and the row number of the cell, or determining a cell fusion feature vector of a cell in the row head based on the coding vector of the cell in the row head and the column number of the cell, wherein the cell fusion feature vector of the cell in the list head contains attention information of other cells located in the same row as the cell, and the cell fusion feature vector of the cell in the row head contains attention information of other cells located in the same column as the cell;

the third fusion layer is used for determining a row fusion feature vector of a row corresponding to the list header based on the cell fusion feature vector of the cell in the list header, or determining a column fusion feature vector of a column corresponding to the row header based on the cell fusion feature vector of the cell in the row header;

the second classification layer is used for determining a classification result of a row based on the row fusion feature vector of the row corresponding to the column header, or determining a classification result of a column based on the column fusion feature vector of the column corresponding to the row header.

11. A form processing system, the system comprising:

the acquisition module is used for acquiring a table to be processed;

the first determining module is used for processing at least one row and/or at least one column in the table to be processed based on a table header detection model, and determining a row table header and/or a list header in the table to be processed;

a second determining module, configured to process the row header based on a header classification model, determine a classification result of a column in the table to be processed, and/or process the column header based on the header classification model, and determine a classification result of a row in the table to be processed;

and the extraction module is used for extracting the table to be processed based on each column and/or each row classification result in the table to be processed and determining a first extraction result.

12. A form processing apparatus, the apparatus comprising at least one processor and at least one memory;

the at least one memory is for storing computer instructions;

the at least one processor is configured to execute at least some of the computer instructions to implement the method of any of claims 1-10.

13. A computer-readable storage medium, wherein the storage medium stores computer instructions which, when executed by a processor, implement the method of any one of claims 1 to 10.