CN113761202A

CN113761202A - Optimization system for mapping unstructured financial Excel table to database

Info

Publication number: CN113761202A
Application number: CN202111000985.4A
Authority: CN
Inventors: 景泳霖; 邹鸿岳; 冉小瑜; 周靖宇
Original assignee: Shanghai Kuaique Information Technology Co ltd
Current assignee: Shanghai Kuaique Information Technology Co ltd
Priority date: 2021-08-30
Filing date: 2021-08-30
Publication date: 2021-12-07

Abstract

An optimization system for mapping unstructured financial Excel tables to a database relates to the technical field of data processing. The method comprises the following specific processes: providing a standard excel file input service interface, and receiving excel table data; judging that a plurality of sheet files exist in the excel, and performing the following operation on each sheet file in sequence; sequentially operating the single sheet file; constructing a set of neural network model, integrating the column names and contents of each column, and predicting the mapping relation between the information of no column and the column names in the database through the neural network model; the results are written to the structure database. The invention has the beneficial effects that: the standardization of the non-structural information in the Excel table is realized, the data in the table is mapped to the corresponding columns of the database, the accuracy rate of 96% -98% in financial services is achieved, and the commercial requirements are basically met.

Description

Optimization system for mapping unstructured financial Excel table to database

Technical Field

The invention relates to the technical field of data processing, in particular to an optimization system for mapping an unstructured financial Excel form to a database.

Background

In data analysis processing, complex tabular data is generally required to be arranged into a standard data format. The complex table usually includes various forms such as "merge cell, split cell" and the like on the table structure, and information such as highlight data, header and the like in the table is expressed in a form such as "change the ground color, the color of characters, and the bold characters of the table" on the representation. For such data, the patent aims to solve the problem that information such as areas of financial tables, table headers, column names and data in the tables are judged through an algorithm model, and the tables are restored by performing reverse operation for the case of merging and splitting cells.

In the prior art, Excel table data is converted into structured table data, and currently, a plurality of processing technologies exist; but can be divided into two technical directions as a whole, one is a method of indexing by rules, and the second is a deep learning neural network technology developed in these years. The rule indexing mainly comprises the steps of reading data and formats (such as whether combination is performed or not, font size, whether thickening is performed or not, background color and the like) in the Excel table through a certain rule and a program, and then sorting out a set of rule logic in a manual mode to judge information such as an area, a header, a column name, a specific data value and the like of the table. The deep neural network technology developed in the years mainly solves the problems of table area identification and positioning in an image and the relation between each unit cell in a table, namely whether the unit cells are combined or split, and whether the unit cells are in the same row or column. The following disadvantages still exist:

in the prior art, firstly, for structure identification of a table, two schemes are generally provided, the first scheme is a method for judging through a rule index, the first scheme has the defect that a large amount of manual work is needed for combing various rule logics, all conditions cannot be contained due to numerous rules, and the problem that the rules conflict with each other is often easy to occur, the second scheme is a neural network-based method, and the second scheme has certain learning and generalization capability for the table in an image, but after Excel data is converted into the image, a part of definite rule information and numerical value information in the table are lost; this results in a temporary failure of accuracy to reach the commercial level.

Secondly, for the table structure identification and data mapping of the universal domain, only the unstructured table is converted into a standard table. The data in the table cannot be mapped, that is, how to judge the corresponding table header according to the specific value in the table. Taking time as an example "2021-12-31" in the field of financial bonds, this value may be "release date" or "cut date", "bid start date", "bid end date", and the like. If placed in a general field, it is more likely, and therefore it is difficult to implement a general field, which corresponds to a column in the database, directly judged from the values in the table.

Disclosure of Invention

Aiming at the defects and shortcomings in the prior art, the invention provides an optimization system for mapping an unstructured financial Excel form to a database, so that the standardization of unstructured information in the Excel form is realized, the unstructured information is mapped to a corresponding column of the database according to data in the form, the accuracy rate in financial services is up to 96% -98%, and the commercial requirements are basically met.

In order to achieve the purpose, the invention adopts the following technical scheme: an optimization system for mapping unstructured financial Excel forms to a database comprises the following specific processes: providing a standard excel file input service interface, and receiving excel table data;

judging that a plurality of sheet files exist in the excel, and performing the following operation on each sheet file in sequence;

firstly, judging the state information of the sheet and recording the state information in a state library; secondly, judging that a plurality of independent tables, the title of each table and the header information are arranged on the basis of rules and an algorithm model of machine learning; finally, recombining the tables, and splitting the merged cells to form a standard M x N table;

constructing a set of neural network model, integrating the column names and contents of each column, and predicting the mapping relation between the information of no column and the column names in the database through the neural network model;

the results are written to the structure database.

Further, the step of providing a standard excel file input service interface, and the step of receiving the service interface in the excel form data, comprises the following steps: providing a standard web service interface, receiving an excel file sent by a front end, judging the file format, judging the sheet format in the file, recording the number of sheets, and extracting data in the sheet table one by one; calling a 'table structure recognition' model and a 'table data mapping' model one by one aiming at each sheet table, acquiring standardized table data in each sheet, and sorting the standardized table data into a standard format; and the service interface returns the standardized data, writes the data into the structural database or returns the data to the front end for displaying, and completes the complete flow of excel table mapping.

Further, the step of judging that table state information in a plurality of sheet files exists in the excel comprises the following steps: a state recorder: recording a starting row/column and an ending row/column; which lines are whole line merging cells, which lines are whole line text bolding, and which lines are whole line ground color modification; adopting a rule and a clustering algorithm to disassemble the sub-tables; identification of table element title, column name, data; table normalization, sub-table reassembly and merge cell splitting.

Further, the state recorder includes the following specific processes: reading table information of a sheet through an open source plug-in, wherein the table information comprises row number and column number; the specific method is that whether the line or the column is started or not is judged by judging whether data exists in each line or each column; logically judging whether each line is provided with a whole line merging cell, a whole line has a ground color or not, and a whole line has a font bold, and respectively writing the information into the state recorder; calculating the length of each row, considering some rows, and then empty, or the problem that a plurality of tables exist in one sheet; and calculating the information of the ending column of each line, wherein the rule of the ending point is that the cells of the current line have no data subsequently.

Further, the sub-table decomposition by using the rule and the clustering algorithm includes the following operation procedures: adopting rules to disassemble a table in a sheet, wherein the rules comprise the following points: rule 1, merging cells in an entire row, defining the merging cells as a division standard among a plurality of sub-tables, rule 2, thickening the cell data in the entire row, defining the merging cells as a division standard among the plurality of sub-tables, rule 3, adding ground colors to the table data in the entire row, defining the ground colors of the previous row and the ground colors of the current row as a division standard among the plurality of sub-tables, and adopting a clustering algorithm to perform clustering according to the length value of each row and the element characteristics of the row according to the clustering result, defining the clustering result as the own division standard of the plurality of sub-tables when two adjacent rows are not in the same class, wherein the clustering method comprises the following specific implementation steps: based on a DBSCAN clustering algorithm, combining text feature input embedding information to enable a centroid to move up and down for multiple times so as to record a certain length of code distribution deviation (Euclidean distance + data type KL distance), and if the deviation is obvious and is close to the current list header deviation, judging that the top and bottom are table boundaries.

Further, the identification of the table element title, column name and data comprises the following identification methods: the title identification of the table adopts rule logic, and the rule is that the cells are merged in the whole row, are merged downwards by default and serve as the title or remark information of the current table; identification of the column names of the table, rule logic, rule 1: if the whole row of fonts are thickened, the default is the column name of the table; rule 2, if the whole row has the background color, and the previous row and the next row have no background color or the background color is different from the current row, the default is the column name of the table; the table data identifies data that does not satisfy the above title and column name logic rules, and the default is data information in the table.

Further, the table standardization, sub-table recombination and merging cell splitting include the following operation flows: the sub-tables are recombined, and for each sub-table, the rule 1 is that if the column name of the current table is the same as that of the next sub-table, or one of the column names of the sub-tables completely belongs to the other column name of the sub-table, the two sub-tables are merged; and for each sub-table, judging whether the merged cells exist according to the table state, splitting the merged cells, and giving the same value to the split cells, thereby forming one to a plurality of standard cells of M × N.

Further, the operation on a single sheet file forms one or more standard M × N sub-tables according to the judgment that several sheet files exist in the excel; each fragment contains: the table comprises four parts of information, namely a table title, a table header, table data and remarks, wherein the table title and the remarks are text information of the whole row of merging cells; the header and table data are standard M × N data, and in the scheme, the header and table data need to be mapped with the column names in the database one by one according to the column names and the table data, and the table data mapping includes the following processes: the data preparation, considering that a column of data of each table may be very long, the scheme adopts a random extraction method, 5-10 unit cell data and column names are extracted from a column of data, a group of column information is combined, each column of data adopts the same method to form N groups of corresponding prediction data, a deep learning neural network is constructed, the network structure is shown as the fourth step in the scheme, through the algorithm model, each column of data is predicted, the mapping relation between the column of data and the column information in the database is judged, and final complete table structure information comprising the title, the column name, the data, remark information and the corresponding relation of the information of each column in the database is output for each sub-table.

Further, the step of constructing a set of neural network model, integrating the column names and contents of each column, and predicting the tabular data classification model in the mapping relation between the information of no column and the column names in the database through the neural network model comprises the following coding layers and processes: the coding layer is used for optimizing the whole pre-trained coding layer by considering that data types such as long texts, character strings, floating point numbers, integers and the like may exist in each table, and floating point, integer and character string information is crucial in a financial system, and the preprocessing of the floating point numbers and the integers by using a native pre-trained model is inappropriate; numerical encoding layer: and (3) carrying out an independent coding mode on the position of the numerical value: according to the corpus characteristics, 5 data with upper limit are supported to enter a numerical value coding system, each digit of each numerical value comprises a sign, an integer part supports 8 digits, a decimal part supports 4 digits, each digit is subjected to lossless one-hot coding and is parallel to text/character string coding, and a text part adopts a sequence mean value, a highest value and a tail sequence to be combined; the dimension of the coding layer is large, the dimension reduction is carried out in a text-cnn-like mode, 1d, a low out-channel and a large stride are used, a linear dimension reduction self-coder is matched, residual splicing is carried out, and dimension reduction operation is achieved; the output layer uses a full connection layer to access an upper layer feature map, dimension change possibly caused by upper layer splicing operation is realized by adopting batch-regularization, 10% of parameters are dropped by drop to prevent overfitting, then the full connection layer is accessed for calculation, multi-classification cross entropy adopted by loss is calculated, and weight adjustment is carried out on an unbalanced part; operating using a neural network model structure; and mapping each column of data to a corresponding column of the database.

Further, the optimization content of the coding layer includes the following: optimizing the pre-training model, and solving the problem that floating point numbers in the native pre-training model can be forcedly separated by decimal points to form a single token; the method comprises the steps that a pre-training coding layer is optimized, floating point numbers and integers in a primary pre-training model are often non-financial background in Wikipedia or other Chinese and English pre-training corpora, have no precision requirement and can be replaced at will, for example, year and common numerical values, even if replaced, the pre-training model cannot be affected by sequence change, financial contexts need very accurate numerical values, numerical value ranges have meanings, and the numerical values cannot be replaced and adjusted easily, so that the problem is solved; the method solves the problems that a numerical part and a character string in a primary pre-training model are treated equally, the training amount is obviously insufficient, and the influence of forgetting characteristics of a neural network is easy to occur; the compression rate of the native pre-training model is high, the numbers are stored to 312 dimensions, the size sequence information is lost, and the channel is shared with a large number of other characters, so that the serious data loss is caused, and the problem is optimized by the scheme.

After the technical scheme is adopted, the invention has the beneficial effects that: the standardization of the non-structural information in the Excel table is realized, the data in the table is mapped to the corresponding columns of the database, the accuracy rate of 96% -98% in financial services is achieved, and the commercial requirements are basically met.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a flow chart of the optimization system of the present invention.

Fig. 2 is a diagram of a complete neural network model architecture in the present invention.

Detailed Description

Referring to fig. 1 to 2, the technical solution adopted by the present embodiment is: the method comprises the following specific processes: providing a standard excel file input service interface, and receiving excel table data;

the results are written to the structure database.

The method for providing the standard excel file input service interface and receiving the service interface in the excel form data comprises the following steps: providing a standard web service interface, receiving an excel file sent by a front end, judging the file format, judging the sheet format in the file, recording the number of sheets, and extracting data in the sheet table one by one; calling a 'table structure recognition' model and a 'table data mapping' model one by one aiming at each sheet table, acquiring standardized table data in each sheet, and sorting the standardized table data into a standard format; and the service interface returns the standardized data, writes the data into the structural database or returns the data to the front end for displaying, and completes the complete flow of excel table mapping.

The extraction of table state information in a plurality of sheet files in the excel is judged to comprise the following steps: a state recorder: recording a starting row/column and an ending row/column; which lines are whole line merging cells, which lines are whole line text bolding, and which lines are whole line ground color modification; adopting a rule and a clustering algorithm to disassemble the sub-tables; identification of table element title, column name, data; the method comprises the following steps of table standardization, sub-table recombination and merging cell splitting, wherein the state recorder comprises the following specific processes: reading table information of a sheet through an open source plug-in, wherein the table information comprises row number and column number; the specific method is that whether the line or the column is started or not is judged by judging whether data exists in each line or each column; logically judging whether each line is provided with a whole line merging cell, a whole line has a ground color or not, and a whole line has a font bold, and respectively writing the information into the state recorder; calculating the length of each row, considering some rows, and then empty, or the problem that a plurality of tables exist in one sheet; calculating the information of the end column of each row, wherein the rule of the end point is that no data exists in the cells of the current row, and the sub-table decomposition by adopting the rule and the clustering algorithm comprises the following operation flows: adopting rules to disassemble a table in a sheet, wherein the rules comprise the following points: rule 1, merging cells in an entire row, defining the merging cells as a division standard among a plurality of sub-tables, rule 2, thickening the cell data in the entire row, defining the merging cells as a division standard among the plurality of sub-tables, rule 3, adding ground colors to the table data in the entire row, defining the ground colors of the previous row and the ground colors of the current row as a division standard among the plurality of sub-tables, and adopting a clustering algorithm to perform clustering according to the length value of each row and the element characteristics of the row according to the clustering result, defining the clustering result as the own division standard of the plurality of sub-tables when two adjacent rows are not in the same class, wherein the clustering method comprises the following specific implementation steps: based on a DBSCAN clustering algorithm, combining text feature input embedding information to enable a centroid to move up and down for multiple times so as to record a certain length code distribution offset (Euclidean distance + data type KL distance), if the offset is obvious and is close to the offset of a current list header, the up and down are judged as a list boundary, and the identification of the list element title, the list name and the data comprises the following identification methods: the title identification of the table adopts rule logic, and the rule is that the cells are merged in the whole row, are merged downwards by default and serve as the title or remark information of the current table; identification of the column names of the table, rule logic, rule 1: if the whole row of fonts are thickened, the default is the column name of the table; rule 2, if the whole row has the background color, and the previous row and the next row have no background color or the background color is different from the current row, the default is the column name of the table; the table data identification, data which does not meet the logic rules of the title and the column name, is data information in a table by default, the table is standardized, and the sub-table recombination and merging cell splitting comprise the following operation flows: the sub-tables are recombined, and for each sub-table, the rule 1 is that if the column name of the current table is the same as that of the next sub-table, or one of the column names of the sub-tables completely belongs to the other column name of the sub-table, the two sub-tables are merged; and for each sub-table, judging whether the merged cells exist according to the table state, splitting the merged cells, and giving the same value to the split cells, thereby forming one to a plurality of standard cells of M × N.

The operation on the single sheet file is carried out, and one or more standard M x N sub-tables are formed according to the judgment that a plurality of sheet files exist in the excel; each fragment contains: the table comprises four parts of information, namely a table title, a table header, table data and remarks, wherein the table title and the remarks are text information of the whole row of merging cells; the header and table data are standard M × N data, and in the scheme, the header and table data need to be mapped with the column names in the database one by one according to the column names and the table data, and the table data mapping includes the following processes: the data preparation, considering that a column of data of each table may be very long, the scheme adopts a random extraction method, 5-10 unit cell data and column names are extracted from a column of data, a group of column information is combined, each column of data adopts the same method to form N groups of corresponding prediction data, a deep learning neural network is constructed, the network structure is shown as the fourth step in the scheme, through the algorithm model, each column of data is predicted, the mapping relation between the column of data and the column information in the database is judged, and final complete table structure information comprising the title, the column name, the data, remark information and the corresponding relation of the information of each column in the database is output for each sub-table.

The method comprises the following steps of constructing a set of neural network model, integrating column names and contents of each column, and predicting a table data classification model in a mapping relation between column names in a column-free information and a database through the neural network model, wherein the table data classification model comprises the following coding layers and processes: the coding layer is used for optimizing the whole pre-trained coding layer by considering that data types such as long texts, character strings, floating point numbers, integers and the like may exist in each table, and floating point, integer and character string information is crucial in a financial system, and the preprocessing of the floating point numbers and the integers by using a native pre-trained model is inappropriate; numerical encoding layer: and (3) carrying out an independent coding mode on the position of the numerical value: according to the corpus characteristics, 5 data with upper limit are supported to enter a numerical value coding system, each digit of each numerical value comprises a sign, an integer part supports 8 digits, a decimal part supports 4 digits, each digit is subjected to lossless one-hot coding and is parallel to text/character string coding, and a text part adopts a sequence mean value, a highest value and a tail sequence to be combined; the dimension of the coding layer is large, the dimension reduction is carried out in a text-cnn-like mode, 1d, a low out-channel and a large stride are used, a linear dimension reduction self-coder is matched, residual splicing is carried out, and dimension reduction operation is achieved; the output layer uses a full connection layer to access an upper layer feature map, dimension change possibly caused by upper layer splicing operation is realized by adopting batch-regularization, 10% of parameters are dropped by drop to prevent overfitting, then the full connection layer is accessed for calculation, multi-classification cross entropy adopted by loss is calculated, and weight adjustment is carried out on an unbalanced part; operating using a neural network model structure; and mapping each column of data to a corresponding column of the database, wherein the optimized content of the coding layer comprises the following steps: optimizing the pre-training model, and solving the problem that floating point numbers in the native pre-training model can be forcedly separated by decimal points to form a single token; the method comprises the steps that a pre-training coding layer is optimized, floating point numbers and integers in a primary pre-training model are often non-financial background in Wikipedia or other Chinese and English pre-training corpora, have no precision requirement and can be replaced at will, for example, year and common numerical values, even if replaced, the pre-training model cannot be affected by sequence change, financial contexts need very accurate numerical values, numerical value ranges have meanings, and the numerical values cannot be replaced and adjusted easily, so that the problem is solved; the method solves the problems that a numerical part and a character string in a primary pre-training model are treated equally, the training amount is obviously insufficient, and the influence of forgetting characteristics of a neural network is easy to occur; the compression rate of the native pre-training model is high, the numbers are stored to 312 dimensions, the size sequence information is lost, and the channel is shared with a large number of other characters, so that the serious data loss is caused, and the problem is optimized by the scheme.

Note: the invention has the following technical expansion characteristics and application scenes:

1. the method adopts a DBSCAN clustering algorithm, integrates Excel form state information extracted by a rule, and distinguishes and identifies structure information of an Excel form, including 'table title, column name, data' and 'a plurality of sub-tables', and the rule index + machine learning method can be expanded or applied to other form identification fields.

2. The data coding mode aiming at the 'large number' in the financial table can be used as an independent module or a processing mode and applied to coding layers or pre-training layers of related deep learning in other fields including 'large number'.

3. The model coding mode of the fusion 'rule label' can also be expanded to other related tasks needing to be combined with 'external rules'.

The above description is only for the purpose of illustrating the technical solutions of the present invention and not for the purpose of limiting the same, and other modifications or equivalent substitutions made by those skilled in the art to the technical solutions of the present invention should be covered within the scope of the claims of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims

1. An optimization system for mapping unstructured financial Excel forms to a database, characterized by: the method comprises the following specific processes:

1) providing a standard excel file input service interface, and receiving excel table data;

2) judging that a plurality of sheet files exist in the excel, and performing the following operation on each sheet file, namely performing the operation step 3 on each sheet in sequence;

3) the following operations are sequentially performed on a single sheet file:

a) firstly, judging the state information of the sheet, and recording the state information in a state library;

b) secondly, judging that a plurality of independent tables, the title of each table and the header information are arranged on the basis of rules and an algorithm model of machine learning;

c) finally, recombining the tables, and splitting the merged cells to form a standard M x N table;

4) constructing a set of neural network model, integrating the column names and contents of each column, and predicting the mapping relation between the information of no column and the column names in the database through the neural network model;

5) the results are written to the structure database.

2. The optimization system for mapping unstructured financial Excel sheets to databases of claim 1, characterized in that: the service interface in the step 1 comprises the following steps:

1) providing a standard web service interface, receiving an excel file sent by a front end, judging the file format, judging the sheet format in the file, recording the number of sheets, and extracting data in the sheet table one by one;

2) calling a 'table structure recognition' model and a 'table data mapping' model one by one aiming at each sheet table, acquiring standardized table data in each sheet, and sorting the standardized table data into a standard format;

3) and the service interface returns the standardized data, writes the data into the structural database or returns the data to the front end for displaying, and completes the complete flow of excel table mapping.

3. The optimization system for mapping unstructured financial Excel sheets to databases of claim 1, characterized in that: the extraction of the table state information in the step 2 comprises the following steps:

1) a state recorder: recording a starting row/column and an ending row/column; which lines are whole line merging cells, which lines are whole line text bolding, and which lines are whole line ground color modification;

2) adopting a rule and a clustering algorithm to disassemble the sub-tables;

3) identification of table element title, column name, data;

4) table normalization, sub-table reassembly and merge cell splitting.

4. The optimization system for mapping unstructured financial Excel sheets to databases of claim 3, characterized in that: the step 1 comprises the following specific processes:

a) reading table information of a sheet through an open source plug-in, wherein the table information comprises row number and column number; the specific method is that whether the line or the column is started or not is judged by judging whether data exists in each line or each column;

b) logically judging whether each line is provided with a whole line merging cell, a whole line has a ground color or not, and a whole line has a font bold, and respectively writing the information into the state recorder;

c) calculating the length of each row, considering some rows, and then empty, or the problem that a plurality of tables exist in one sheet; and calculating the information of the ending column of each line, wherein the rule of the ending point is that the cells of the current line have no data subsequently.

5. The optimization system for mapping unstructured financial Excel sheets to databases of claim 3, characterized in that: the step 2 comprises the following operation flows:

a) adopting rules to disassemble a table in a sheet, wherein the rules comprise the following points:

rule 1, the condition of merging cells in the whole row is defined as a division standard among a plurality of sub-tables;

rule 2, under the condition that the cell data of the whole row is thickened, defining the cell data as the division standard among a plurality of sub-tables;

rule 3, adding the ground color to the table data of the whole row, wherein the ground color of the previous row is inconsistent with the ground color of the current row and is defined as a division standard among a plurality of sub-tables;

a rule 4 is to adopt a clustering algorithm to perform clustering according to the length value of each line and the element characteristics in the line, and define the clustering result as the own segmentation standard of a plurality of sub-tables when two adjacent lines are not in the same class;

b) the specific implementation method of clustering comprises the following steps: based on a DBSCAN clustering algorithm, combining text feature input embedding information to enable a centroid to move up and down for multiple times so as to record a certain length of code distribution deviation (Euclidean distance + data type KL distance), and if the deviation is obvious and is close to the current list header deviation, judging that the top and bottom are table boundaries.

6. The optimization system for mapping unstructured financial Excel sheets to databases of claim 3, characterized in that: the step 3 comprises the following identification methods:

a) the title identification of the table adopts rule logic, and the rule is that the cells are merged in the whole row, are merged downwards by default and serve as the title or remark information of the current table;

b) identification of the column names of the table, rule logic, rule 1: if the whole row of fonts are thickened, the default is the column name of the table; rule 2, if the whole row has the background color, and the previous row and the next row have no background color or the background color is different from the current row, the default is the column name of the table;

c) the table data identifies data that does not satisfy the above title and column name logic rules, and the default is data information in the table.

7. The optimization system for mapping unstructured financial Excel sheets to databases of claim 3, characterized in that: the step 4 comprises the following operation flows:

a) the sub-tables are recombined, and for each sub-table, the rule 1 is that if the column name of the current table is the same as that of the next sub-table, or one of the column names of the sub-tables completely belongs to the other column name of the sub-table, the two sub-tables are merged;

b) and for each sub-table, judging whether the merged cells exist according to the table state, splitting the merged cells, and giving the same value to the split cells, thereby forming one to a plurality of standard cells of M × N.

8. The optimization system for mapping unstructured financial Excel sheets to databases of claim 1, characterized in that: said step 3 according to step 2, we form one or more standard M x N sub-tables; each fragment contains: the scheme needs to map the table title and the table data with the column names in the database one by one according to the column names and the table data, and the mapping of the table data comprises the following processes:

1) preparing data, wherein in consideration of the possibility that a column of data of each table is very long, a scheme adopts a random extraction method, 5-10 unit cell data and column names are extracted from the column of data and combined into a group of column information, and each column of data adopts the same method to form N groups of corresponding prediction data;

2) constructing a deep learning neural network, wherein the network structure of the deep learning neural network is shown as the fourth step in the scheme, predicting each line of data through the algorithm model, and judging the mapping relation between the line of data and the line information in the database;

3) and outputting final complete table structure information containing the title, column name, data and remark information of the table and the corresponding relation of the information of each column in the database aiming at each sub-table.

9. The optimization system for mapping unstructured financial Excel sheets to databases of claim 1, characterized in that: the table data classification model in step 4 includes the following coding layers and processes:

1) the coding layer is used for optimizing the whole pre-trained coding layer by considering that data types such as long texts, character strings, floating point numbers, integers and the like may exist in each table, and floating point, integer and character string information is crucial in a financial system, and the preprocessing of the floating point numbers and the integers by using a native pre-trained model is inappropriate;

2) numerical encoding layer: and (3) carrying out an independent coding mode on the position of the numerical value: according to the corpus characteristics, 5 data with upper limit are supported to enter a numerical value coding system, each digit of each numerical value comprises a sign, an integer part supports 8 digits, a decimal part supports 4 digits, each digit is subjected to lossless one-hot coding and is parallel to text/character string coding, and a text part adopts a sequence mean value, a highest value and a tail sequence to be combined;

3) the dimension of the coding layer is large, the dimension reduction is carried out in a text-cnn-like mode, 1d, a low out-channel and a large stride are used, a linear dimension reduction self-coder is matched, residual splicing is carried out, and dimension reduction operation is achieved;

4) the output layer uses a full connection layer to access an upper layer feature map, dimension change possibly caused by upper layer splicing operation is realized by adopting batch-regularization, 10% of parameters are dropped by drop to prevent overfitting, then the full connection layer is accessed for calculation, multi-classification cross entropy adopted by loss is calculated, and weight adjustment is carried out on an unbalanced part;

5) operating using a neural network model structure;

6) and mapping each column of data to a corresponding column of the database.

10. The optimization system for mapping unstructured financial Excel sheets to a database in accordance with claim 9, characterized by: the optimized content of the coding layer comprises the following steps:

a) optimizing the pre-training model, and solving the problem that floating point numbers in the native pre-training model can be forcedly separated by decimal points to form a single token;

b) the method comprises the steps that a pre-training coding layer is optimized, floating point numbers and integers in a primary pre-training model are often non-financial background in Wikipedia or other Chinese and English pre-training corpora, have no precision requirement and can be replaced at will, for example, year and common numerical values, even if replaced, the pre-training model cannot be affected by sequence change, financial contexts need very accurate numerical values, numerical value ranges have meanings, and the numerical values cannot be replaced and adjusted easily, so that the problem is solved;

c) the method solves the problems that a numerical part and a character string in a primary pre-training model are treated equally, the training amount is obviously insufficient, and the influence of forgetting characteristics of a neural network is easy to occur;

d) the compression rate of the native pre-training model is high, the numbers are stored to 312 dimensions, the size sequence information is lost, and the channel is shared with a large number of other characters, so that the serious data loss is caused, and the problem is optimized by the scheme.