CN113761202A - Optimization system for mapping unstructured financial Excel table to database - Google Patents

Optimization system for mapping unstructured financial Excel table to database Download PDF

Info

Publication number
CN113761202A
CN113761202A CN202111000985.4A CN202111000985A CN113761202A CN 113761202 A CN113761202 A CN 113761202A CN 202111000985 A CN202111000985 A CN 202111000985A CN 113761202 A CN113761202 A CN 113761202A
Authority
CN
China
Prior art keywords
data
column
information
excel
sub
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111000985.4A
Other languages
Chinese (zh)
Inventor
景泳霖
邹鸿岳
冉小瑜
周靖宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Kuaique Information Technology Co ltd
Original Assignee
Shanghai Kuaique Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Kuaique Information Technology Co ltd filed Critical Shanghai Kuaique Information Technology Co ltd
Priority to CN202111000985.4A priority Critical patent/CN113761202A/en
Publication of CN113761202A publication Critical patent/CN113761202A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/177Editing, e.g. inserting or deleting of tables; using ruled lines
    • G06F40/18Editing, e.g. inserting or deleting of tables; using ruled lines of spreadsheets
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Machine Translation (AREA)

Abstract

An optimization system for mapping unstructured financial Excel tables to a database relates to the technical field of data processing. The method comprises the following specific processes: providing a standard excel file input service interface, and receiving excel table data; judging that a plurality of sheet files exist in the excel, and performing the following operation on each sheet file in sequence; sequentially operating the single sheet file; constructing a set of neural network model, integrating the column names and contents of each column, and predicting the mapping relation between the information of no column and the column names in the database through the neural network model; the results are written to the structure database. The invention has the beneficial effects that: the standardization of the non-structural information in the Excel table is realized, the data in the table is mapped to the corresponding columns of the database, the accuracy rate of 96% -98% in financial services is achieved, and the commercial requirements are basically met.

Description

Optimization system for mapping unstructured financial Excel table to database
Technical Field
The invention relates to the technical field of data processing, in particular to an optimization system for mapping an unstructured financial Excel form to a database.
Background
In data analysis processing, complex tabular data is generally required to be arranged into a standard data format. The complex table usually includes various forms such as "merge cell, split cell" and the like on the table structure, and information such as highlight data, header and the like in the table is expressed in a form such as "change the ground color, the color of characters, and the bold characters of the table" on the representation. For such data, the patent aims to solve the problem that information such as areas of financial tables, table headers, column names and data in the tables are judged through an algorithm model, and the tables are restored by performing reverse operation for the case of merging and splitting cells.
In the prior art, Excel table data is converted into structured table data, and currently, a plurality of processing technologies exist; but can be divided into two technical directions as a whole, one is a method of indexing by rules, and the second is a deep learning neural network technology developed in these years. The rule indexing mainly comprises the steps of reading data and formats (such as whether combination is performed or not, font size, whether thickening is performed or not, background color and the like) in the Excel table through a certain rule and a program, and then sorting out a set of rule logic in a manual mode to judge information such as an area, a header, a column name, a specific data value and the like of the table. The deep neural network technology developed in the years mainly solves the problems of table area identification and positioning in an image and the relation between each unit cell in a table, namely whether the unit cells are combined or split, and whether the unit cells are in the same row or column. The following disadvantages still exist:
in the prior art, firstly, for structure identification of a table, two schemes are generally provided, the first scheme is a method for judging through a rule index, the first scheme has the defect that a large amount of manual work is needed for combing various rule logics, all conditions cannot be contained due to numerous rules, and the problem that the rules conflict with each other is often easy to occur, the second scheme is a neural network-based method, and the second scheme has certain learning and generalization capability for the table in an image, but after Excel data is converted into the image, a part of definite rule information and numerical value information in the table are lost; this results in a temporary failure of accuracy to reach the commercial level.
Secondly, for the table structure identification and data mapping of the universal domain, only the unstructured table is converted into a standard table. The data in the table cannot be mapped, that is, how to judge the corresponding table header according to the specific value in the table. Taking time as an example "2021-12-31" in the field of financial bonds, this value may be "release date" or "cut date", "bid start date", "bid end date", and the like. If placed in a general field, it is more likely, and therefore it is difficult to implement a general field, which corresponds to a column in the database, directly judged from the values in the table.
Disclosure of Invention
Aiming at the defects and shortcomings in the prior art, the invention provides an optimization system for mapping an unstructured financial Excel form to a database, so that the standardization of unstructured information in the Excel form is realized, the unstructured information is mapped to a corresponding column of the database according to data in the form, the accuracy rate in financial services is up to 96% -98%, and the commercial requirements are basically met.
In order to achieve the purpose, the invention adopts the following technical scheme: an optimization system for mapping unstructured financial Excel forms to a database comprises the following specific processes: providing a standard excel file input service interface, and receiving excel table data;
judging that a plurality of sheet files exist in the excel, and performing the following operation on each sheet file in sequence;
firstly, judging the state information of the sheet and recording the state information in a state library; secondly, judging that a plurality of independent tables, the title of each table and the header information are arranged on the basis of rules and an algorithm model of machine learning; finally, recombining the tables, and splitting the merged cells to form a standard M x N table;
constructing a set of neural network model, integrating the column names and contents of each column, and predicting the mapping relation between the information of no column and the column names in the database through the neural network model;
the results are written to the structure database.
Further, the step of providing a standard excel file input service interface, and the step of receiving the service interface in the excel form data, comprises the following steps: providing a standard web service interface, receiving an excel file sent by a front end, judging the file format, judging the sheet format in the file, recording the number of sheets, and extracting data in the sheet table one by one; calling a 'table structure recognition' model and a 'table data mapping' model one by one aiming at each sheet table, acquiring standardized table data in each sheet, and sorting the standardized table data into a standard format; and the service interface returns the standardized data, writes the data into the structural database or returns the data to the front end for displaying, and completes the complete flow of excel table mapping.
Further, the step of judging that table state information in a plurality of sheet files exists in the excel comprises the following steps: a state recorder: recording a starting row/column and an ending row/column; which lines are whole line merging cells, which lines are whole line text bolding, and which lines are whole line ground color modification; adopting a rule and a clustering algorithm to disassemble the sub-tables; identification of table element title, column name, data; table normalization, sub-table reassembly and merge cell splitting.
Further, the state recorder includes the following specific processes: reading table information of a sheet through an open source plug-in, wherein the table information comprises row number and column number; the specific method is that whether the line or the column is started or not is judged by judging whether data exists in each line or each column; logically judging whether each line is provided with a whole line merging cell, a whole line has a ground color or not, and a whole line has a font bold, and respectively writing the information into the state recorder; calculating the length of each row, considering some rows, and then empty, or the problem that a plurality of tables exist in one sheet; and calculating the information of the ending column of each line, wherein the rule of the ending point is that the cells of the current line have no data subsequently.
Further, the sub-table decomposition by using the rule and the clustering algorithm includes the following operation procedures: adopting rules to disassemble a table in a sheet, wherein the rules comprise the following points: rule 1, merging cells in an entire row, defining the merging cells as a division standard among a plurality of sub-tables, rule 2, thickening the cell data in the entire row, defining the merging cells as a division standard among the plurality of sub-tables, rule 3, adding ground colors to the table data in the entire row, defining the ground colors of the previous row and the ground colors of the current row as a division standard among the plurality of sub-tables, and adopting a clustering algorithm to perform clustering according to the length value of each row and the element characteristics of the row according to the clustering result, defining the clustering result as the own division standard of the plurality of sub-tables when two adjacent rows are not in the same class, wherein the clustering method comprises the following specific implementation steps: based on a DBSCAN clustering algorithm, combining text feature input embedding information to enable a centroid to move up and down for multiple times so as to record a certain length of code distribution deviation (Euclidean distance + data type KL distance), and if the deviation is obvious and is close to the current list header deviation, judging that the top and bottom are table boundaries.
Further, the identification of the table element title, column name and data comprises the following identification methods: the title identification of the table adopts rule logic, and the rule is that the cells are merged in the whole row, are merged downwards by default and serve as the title or remark information of the current table; identification of the column names of the table, rule logic, rule 1: if the whole row of fonts are thickened, the default is the column name of the table; rule 2, if the whole row has the background color, and the previous row and the next row have no background color or the background color is different from the current row, the default is the column name of the table; the table data identifies data that does not satisfy the above title and column name logic rules, and the default is data information in the table.
Further, the table standardization, sub-table recombination and merging cell splitting include the following operation flows: the sub-tables are recombined, and for each sub-table, the rule 1 is that if the column name of the current table is the same as that of the next sub-table, or one of the column names of the sub-tables completely belongs to the other column name of the sub-table, the two sub-tables are merged; and for each sub-table, judging whether the merged cells exist according to the table state, splitting the merged cells, and giving the same value to the split cells, thereby forming one to a plurality of standard cells of M × N.
Further, the operation on a single sheet file forms one or more standard M × N sub-tables according to the judgment that several sheet files exist in the excel; each fragment contains: the table comprises four parts of information, namely a table title, a table header, table data and remarks, wherein the table title and the remarks are text information of the whole row of merging cells; the header and table data are standard M × N data, and in the scheme, the header and table data need to be mapped with the column names in the database one by one according to the column names and the table data, and the table data mapping includes the following processes: the data preparation, considering that a column of data of each table may be very long, the scheme adopts a random extraction method, 5-10 unit cell data and column names are extracted from a column of data, a group of column information is combined, each column of data adopts the same method to form N groups of corresponding prediction data, a deep learning neural network is constructed, the network structure is shown as the fourth step in the scheme, through the algorithm model, each column of data is predicted, the mapping relation between the column of data and the column information in the database is judged, and final complete table structure information comprising the title, the column name, the data, remark information and the corresponding relation of the information of each column in the database is output for each sub-table.
Further, the step of constructing a set of neural network model, integrating the column names and contents of each column, and predicting the tabular data classification model in the mapping relation between the information of no column and the column names in the database through the neural network model comprises the following coding layers and processes: the coding layer is used for optimizing the whole pre-trained coding layer by considering that data types such as long texts, character strings, floating point numbers, integers and the like may exist in each table, and floating point, integer and character string information is crucial in a financial system, and the preprocessing of the floating point numbers and the integers by using a native pre-trained model is inappropriate; numerical encoding layer: and (3) carrying out an independent coding mode on the position of the numerical value: according to the corpus characteristics, 5 data with upper limit are supported to enter a numerical value coding system, each digit of each numerical value comprises a sign, an integer part supports 8 digits, a decimal part supports 4 digits, each digit is subjected to lossless one-hot coding and is parallel to text/character string coding, and a text part adopts a sequence mean value, a highest value and a tail sequence to be combined; the dimension of the coding layer is large, the dimension reduction is carried out in a text-cnn-like mode, 1d, a low out-channel and a large stride are used, a linear dimension reduction self-coder is matched, residual splicing is carried out, and dimension reduction operation is achieved; the output layer uses a full connection layer to access an upper layer feature map, dimension change possibly caused by upper layer splicing operation is realized by adopting batch-regularization, 10% of parameters are dropped by drop to prevent overfitting, then the full connection layer is accessed for calculation, multi-classification cross entropy adopted by loss is calculated, and weight adjustment is carried out on an unbalanced part; operating using a neural network model structure; and mapping each column of data to a corresponding column of the database.
Further, the optimization content of the coding layer includes the following: optimizing the pre-training model, and solving the problem that floating point numbers in the native pre-training model can be forcedly separated by decimal points to form a single token; the method comprises the steps that a pre-training coding layer is optimized, floating point numbers and integers in a primary pre-training model are often non-financial background in Wikipedia or other Chinese and English pre-training corpora, have no precision requirement and can be replaced at will, for example, year and common numerical values, even if replaced, the pre-training model cannot be affected by sequence change, financial contexts need very accurate numerical values, numerical value ranges have meanings, and the numerical values cannot be replaced and adjusted easily, so that the problem is solved; the method solves the problems that a numerical part and a character string in a primary pre-training model are treated equally, the training amount is obviously insufficient, and the influence of forgetting characteristics of a neural network is easy to occur; the compression rate of the native pre-training model is high, the numbers are stored to 312 dimensions, the size sequence information is lost, and the channel is shared with a large number of other characters, so that the serious data loss is caused, and the problem is optimized by the scheme.
After the technical scheme is adopted, the invention has the beneficial effects that: the standardization of the non-structural information in the Excel table is realized, the data in the table is mapped to the corresponding columns of the database, the accuracy rate of 96% -98% in financial services is achieved, and the commercial requirements are basically met.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a flow chart of the optimization system of the present invention.
Fig. 2 is a diagram of a complete neural network model architecture in the present invention.
Detailed Description
Referring to fig. 1 to 2, the technical solution adopted by the present embodiment is: the method comprises the following specific processes: providing a standard excel file input service interface, and receiving excel table data;
judging that a plurality of sheet files exist in the excel, and performing the following operation on each sheet file in sequence;
firstly, judging the state information of the sheet and recording the state information in a state library; secondly, judging that a plurality of independent tables, the title of each table and the header information are arranged on the basis of rules and an algorithm model of machine learning; finally, recombining the tables, and splitting the merged cells to form a standard M x N table;
constructing a set of neural network model, integrating the column names and contents of each column, and predicting the mapping relation between the information of no column and the column names in the database through the neural network model;
the results are written to the structure database.
The method for providing the standard excel file input service interface and receiving the service interface in the excel form data comprises the following steps: providing a standard web service interface, receiving an excel file sent by a front end, judging the file format, judging the sheet format in the file, recording the number of sheets, and extracting data in the sheet table one by one; calling a 'table structure recognition' model and a 'table data mapping' model one by one aiming at each sheet table, acquiring standardized table data in each sheet, and sorting the standardized table data into a standard format; and the service interface returns the standardized data, writes the data into the structural database or returns the data to the front end for displaying, and completes the complete flow of excel table mapping.
The extraction of table state information in a plurality of sheet files in the excel is judged to comprise the following steps: a state recorder: recording a starting row/column and an ending row/column; which lines are whole line merging cells, which lines are whole line text bolding, and which lines are whole line ground color modification; adopting a rule and a clustering algorithm to disassemble the sub-tables; identification of table element title, column name, data; the method comprises the following steps of table standardization, sub-table recombination and merging cell splitting, wherein the state recorder comprises the following specific processes: reading table information of a sheet through an open source plug-in, wherein the table information comprises row number and column number; the specific method is that whether the line or the column is started or not is judged by judging whether data exists in each line or each column; logically judging whether each line is provided with a whole line merging cell, a whole line has a ground color or not, and a whole line has a font bold, and respectively writing the information into the state recorder; calculating the length of each row, considering some rows, and then empty, or the problem that a plurality of tables exist in one sheet; calculating the information of the end column of each row, wherein the rule of the end point is that no data exists in the cells of the current row, and the sub-table decomposition by adopting the rule and the clustering algorithm comprises the following operation flows: adopting rules to disassemble a table in a sheet, wherein the rules comprise the following points: rule 1, merging cells in an entire row, defining the merging cells as a division standard among a plurality of sub-tables, rule 2, thickening the cell data in the entire row, defining the merging cells as a division standard among the plurality of sub-tables, rule 3, adding ground colors to the table data in the entire row, defining the ground colors of the previous row and the ground colors of the current row as a division standard among the plurality of sub-tables, and adopting a clustering algorithm to perform clustering according to the length value of each row and the element characteristics of the row according to the clustering result, defining the clustering result as the own division standard of the plurality of sub-tables when two adjacent rows are not in the same class, wherein the clustering method comprises the following specific implementation steps: based on a DBSCAN clustering algorithm, combining text feature input embedding information to enable a centroid to move up and down for multiple times so as to record a certain length code distribution offset (Euclidean distance + data type KL distance), if the offset is obvious and is close to the offset of a current list header, the up and down are judged as a list boundary, and the identification of the list element title, the list name and the data comprises the following identification methods: the title identification of the table adopts rule logic, and the rule is that the cells are merged in the whole row, are merged downwards by default and serve as the title or remark information of the current table; identification of the column names of the table, rule logic, rule 1: if the whole row of fonts are thickened, the default is the column name of the table; rule 2, if the whole row has the background color, and the previous row and the next row have no background color or the background color is different from the current row, the default is the column name of the table; the table data identification, data which does not meet the logic rules of the title and the column name, is data information in a table by default, the table is standardized, and the sub-table recombination and merging cell splitting comprise the following operation flows: the sub-tables are recombined, and for each sub-table, the rule 1 is that if the column name of the current table is the same as that of the next sub-table, or one of the column names of the sub-tables completely belongs to the other column name of the sub-table, the two sub-tables are merged; and for each sub-table, judging whether the merged cells exist according to the table state, splitting the merged cells, and giving the same value to the split cells, thereby forming one to a plurality of standard cells of M × N.
The operation on the single sheet file is carried out, and one or more standard M x N sub-tables are formed according to the judgment that a plurality of sheet files exist in the excel; each fragment contains: the table comprises four parts of information, namely a table title, a table header, table data and remarks, wherein the table title and the remarks are text information of the whole row of merging cells; the header and table data are standard M × N data, and in the scheme, the header and table data need to be mapped with the column names in the database one by one according to the column names and the table data, and the table data mapping includes the following processes: the data preparation, considering that a column of data of each table may be very long, the scheme adopts a random extraction method, 5-10 unit cell data and column names are extracted from a column of data, a group of column information is combined, each column of data adopts the same method to form N groups of corresponding prediction data, a deep learning neural network is constructed, the network structure is shown as the fourth step in the scheme, through the algorithm model, each column of data is predicted, the mapping relation between the column of data and the column information in the database is judged, and final complete table structure information comprising the title, the column name, the data, remark information and the corresponding relation of the information of each column in the database is output for each sub-table.
The method comprises the following steps of constructing a set of neural network model, integrating column names and contents of each column, and predicting a table data classification model in a mapping relation between column names in a column-free information and a database through the neural network model, wherein the table data classification model comprises the following coding layers and processes: the coding layer is used for optimizing the whole pre-trained coding layer by considering that data types such as long texts, character strings, floating point numbers, integers and the like may exist in each table, and floating point, integer and character string information is crucial in a financial system, and the preprocessing of the floating point numbers and the integers by using a native pre-trained model is inappropriate; numerical encoding layer: and (3) carrying out an independent coding mode on the position of the numerical value: according to the corpus characteristics, 5 data with upper limit are supported to enter a numerical value coding system, each digit of each numerical value comprises a sign, an integer part supports 8 digits, a decimal part supports 4 digits, each digit is subjected to lossless one-hot coding and is parallel to text/character string coding, and a text part adopts a sequence mean value, a highest value and a tail sequence to be combined; the dimension of the coding layer is large, the dimension reduction is carried out in a text-cnn-like mode, 1d, a low out-channel and a large stride are used, a linear dimension reduction self-coder is matched, residual splicing is carried out, and dimension reduction operation is achieved; the output layer uses a full connection layer to access an upper layer feature map, dimension change possibly caused by upper layer splicing operation is realized by adopting batch-regularization, 10% of parameters are dropped by drop to prevent overfitting, then the full connection layer is accessed for calculation, multi-classification cross entropy adopted by loss is calculated, and weight adjustment is carried out on an unbalanced part; operating using a neural network model structure; and mapping each column of data to a corresponding column of the database, wherein the optimized content of the coding layer comprises the following steps: optimizing the pre-training model, and solving the problem that floating point numbers in the native pre-training model can be forcedly separated by decimal points to form a single token; the method comprises the steps that a pre-training coding layer is optimized, floating point numbers and integers in a primary pre-training model are often non-financial background in Wikipedia or other Chinese and English pre-training corpora, have no precision requirement and can be replaced at will, for example, year and common numerical values, even if replaced, the pre-training model cannot be affected by sequence change, financial contexts need very accurate numerical values, numerical value ranges have meanings, and the numerical values cannot be replaced and adjusted easily, so that the problem is solved; the method solves the problems that a numerical part and a character string in a primary pre-training model are treated equally, the training amount is obviously insufficient, and the influence of forgetting characteristics of a neural network is easy to occur; the compression rate of the native pre-training model is high, the numbers are stored to 312 dimensions, the size sequence information is lost, and the channel is shared with a large number of other characters, so that the serious data loss is caused, and the problem is optimized by the scheme.
Note: the invention has the following technical expansion characteristics and application scenes:
1. the method adopts a DBSCAN clustering algorithm, integrates Excel form state information extracted by a rule, and distinguishes and identifies structure information of an Excel form, including 'table title, column name, data' and 'a plurality of sub-tables', and the rule index + machine learning method can be expanded or applied to other form identification fields.
2. The data coding mode aiming at the 'large number' in the financial table can be used as an independent module or a processing mode and applied to coding layers or pre-training layers of related deep learning in other fields including 'large number'.
3. The model coding mode of the fusion 'rule label' can also be expanded to other related tasks needing to be combined with 'external rules'.
The above description is only for the purpose of illustrating the technical solutions of the present invention and not for the purpose of limiting the same, and other modifications or equivalent substitutions made by those skilled in the art to the technical solutions of the present invention should be covered within the scope of the claims of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims (10)

1. An optimization system for mapping unstructured financial Excel forms to a database, characterized by: the method comprises the following specific processes:
1) providing a standard excel file input service interface, and receiving excel table data;
2) judging that a plurality of sheet files exist in the excel, and performing the following operation on each sheet file, namely performing the operation step 3 on each sheet in sequence;
3) the following operations are sequentially performed on a single sheet file:
a) firstly, judging the state information of the sheet, and recording the state information in a state library;
b) secondly, judging that a plurality of independent tables, the title of each table and the header information are arranged on the basis of rules and an algorithm model of machine learning;
c) finally, recombining the tables, and splitting the merged cells to form a standard M x N table;
4) constructing a set of neural network model, integrating the column names and contents of each column, and predicting the mapping relation between the information of no column and the column names in the database through the neural network model;
5) the results are written to the structure database.
2. The optimization system for mapping unstructured financial Excel sheets to databases of claim 1, characterized in that: the service interface in the step 1 comprises the following steps:
1) providing a standard web service interface, receiving an excel file sent by a front end, judging the file format, judging the sheet format in the file, recording the number of sheets, and extracting data in the sheet table one by one;
2) calling a 'table structure recognition' model and a 'table data mapping' model one by one aiming at each sheet table, acquiring standardized table data in each sheet, and sorting the standardized table data into a standard format;
3) and the service interface returns the standardized data, writes the data into the structural database or returns the data to the front end for displaying, and completes the complete flow of excel table mapping.
3. The optimization system for mapping unstructured financial Excel sheets to databases of claim 1, characterized in that: the extraction of the table state information in the step 2 comprises the following steps:
1) a state recorder: recording a starting row/column and an ending row/column; which lines are whole line merging cells, which lines are whole line text bolding, and which lines are whole line ground color modification;
2) adopting a rule and a clustering algorithm to disassemble the sub-tables;
3) identification of table element title, column name, data;
4) table normalization, sub-table reassembly and merge cell splitting.
4. The optimization system for mapping unstructured financial Excel sheets to databases of claim 3, characterized in that: the step 1 comprises the following specific processes:
a) reading table information of a sheet through an open source plug-in, wherein the table information comprises row number and column number; the specific method is that whether the line or the column is started or not is judged by judging whether data exists in each line or each column;
b) logically judging whether each line is provided with a whole line merging cell, a whole line has a ground color or not, and a whole line has a font bold, and respectively writing the information into the state recorder;
c) calculating the length of each row, considering some rows, and then empty, or the problem that a plurality of tables exist in one sheet; and calculating the information of the ending column of each line, wherein the rule of the ending point is that the cells of the current line have no data subsequently.
5. The optimization system for mapping unstructured financial Excel sheets to databases of claim 3, characterized in that: the step 2 comprises the following operation flows:
a) adopting rules to disassemble a table in a sheet, wherein the rules comprise the following points:
rule 1, the condition of merging cells in the whole row is defined as a division standard among a plurality of sub-tables;
rule 2, under the condition that the cell data of the whole row is thickened, defining the cell data as the division standard among a plurality of sub-tables;
rule 3, adding the ground color to the table data of the whole row, wherein the ground color of the previous row is inconsistent with the ground color of the current row and is defined as a division standard among a plurality of sub-tables;
a rule 4 is to adopt a clustering algorithm to perform clustering according to the length value of each line and the element characteristics in the line, and define the clustering result as the own segmentation standard of a plurality of sub-tables when two adjacent lines are not in the same class;
b) the specific implementation method of clustering comprises the following steps: based on a DBSCAN clustering algorithm, combining text feature input embedding information to enable a centroid to move up and down for multiple times so as to record a certain length of code distribution deviation (Euclidean distance + data type KL distance), and if the deviation is obvious and is close to the current list header deviation, judging that the top and bottom are table boundaries.
6. The optimization system for mapping unstructured financial Excel sheets to databases of claim 3, characterized in that: the step 3 comprises the following identification methods:
a) the title identification of the table adopts rule logic, and the rule is that the cells are merged in the whole row, are merged downwards by default and serve as the title or remark information of the current table;
b) identification of the column names of the table, rule logic, rule 1: if the whole row of fonts are thickened, the default is the column name of the table; rule 2, if the whole row has the background color, and the previous row and the next row have no background color or the background color is different from the current row, the default is the column name of the table;
c) the table data identifies data that does not satisfy the above title and column name logic rules, and the default is data information in the table.
7. The optimization system for mapping unstructured financial Excel sheets to databases of claim 3, characterized in that: the step 4 comprises the following operation flows:
a) the sub-tables are recombined, and for each sub-table, the rule 1 is that if the column name of the current table is the same as that of the next sub-table, or one of the column names of the sub-tables completely belongs to the other column name of the sub-table, the two sub-tables are merged;
b) and for each sub-table, judging whether the merged cells exist according to the table state, splitting the merged cells, and giving the same value to the split cells, thereby forming one to a plurality of standard cells of M × N.
8. The optimization system for mapping unstructured financial Excel sheets to databases of claim 1, characterized in that: said step 3 according to step 2, we form one or more standard M x N sub-tables; each fragment contains: the scheme needs to map the table title and the table data with the column names in the database one by one according to the column names and the table data, and the mapping of the table data comprises the following processes:
1) preparing data, wherein in consideration of the possibility that a column of data of each table is very long, a scheme adopts a random extraction method, 5-10 unit cell data and column names are extracted from the column of data and combined into a group of column information, and each column of data adopts the same method to form N groups of corresponding prediction data;
2) constructing a deep learning neural network, wherein the network structure of the deep learning neural network is shown as the fourth step in the scheme, predicting each line of data through the algorithm model, and judging the mapping relation between the line of data and the line information in the database;
3) and outputting final complete table structure information containing the title, column name, data and remark information of the table and the corresponding relation of the information of each column in the database aiming at each sub-table.
9. The optimization system for mapping unstructured financial Excel sheets to databases of claim 1, characterized in that: the table data classification model in step 4 includes the following coding layers and processes:
1) the coding layer is used for optimizing the whole pre-trained coding layer by considering that data types such as long texts, character strings, floating point numbers, integers and the like may exist in each table, and floating point, integer and character string information is crucial in a financial system, and the preprocessing of the floating point numbers and the integers by using a native pre-trained model is inappropriate;
2) numerical encoding layer: and (3) carrying out an independent coding mode on the position of the numerical value: according to the corpus characteristics, 5 data with upper limit are supported to enter a numerical value coding system, each digit of each numerical value comprises a sign, an integer part supports 8 digits, a decimal part supports 4 digits, each digit is subjected to lossless one-hot coding and is parallel to text/character string coding, and a text part adopts a sequence mean value, a highest value and a tail sequence to be combined;
3) the dimension of the coding layer is large, the dimension reduction is carried out in a text-cnn-like mode, 1d, a low out-channel and a large stride are used, a linear dimension reduction self-coder is matched, residual splicing is carried out, and dimension reduction operation is achieved;
4) the output layer uses a full connection layer to access an upper layer feature map, dimension change possibly caused by upper layer splicing operation is realized by adopting batch-regularization, 10% of parameters are dropped by drop to prevent overfitting, then the full connection layer is accessed for calculation, multi-classification cross entropy adopted by loss is calculated, and weight adjustment is carried out on an unbalanced part;
5) operating using a neural network model structure;
6) and mapping each column of data to a corresponding column of the database.
10. The optimization system for mapping unstructured financial Excel sheets to a database in accordance with claim 9, characterized by: the optimized content of the coding layer comprises the following steps:
a) optimizing the pre-training model, and solving the problem that floating point numbers in the native pre-training model can be forcedly separated by decimal points to form a single token;
b) the method comprises the steps that a pre-training coding layer is optimized, floating point numbers and integers in a primary pre-training model are often non-financial background in Wikipedia or other Chinese and English pre-training corpora, have no precision requirement and can be replaced at will, for example, year and common numerical values, even if replaced, the pre-training model cannot be affected by sequence change, financial contexts need very accurate numerical values, numerical value ranges have meanings, and the numerical values cannot be replaced and adjusted easily, so that the problem is solved;
c) the method solves the problems that a numerical part and a character string in a primary pre-training model are treated equally, the training amount is obviously insufficient, and the influence of forgetting characteristics of a neural network is easy to occur;
d) the compression rate of the native pre-training model is high, the numbers are stored to 312 dimensions, the size sequence information is lost, and the channel is shared with a large number of other characters, so that the serious data loss is caused, and the problem is optimized by the scheme.
CN202111000985.4A 2021-08-30 2021-08-30 Optimization system for mapping unstructured financial Excel table to database Pending CN113761202A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111000985.4A CN113761202A (en) 2021-08-30 2021-08-30 Optimization system for mapping unstructured financial Excel table to database

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111000985.4A CN113761202A (en) 2021-08-30 2021-08-30 Optimization system for mapping unstructured financial Excel table to database

Publications (1)

Publication Number Publication Date
CN113761202A true CN113761202A (en) 2021-12-07

Family

ID=78791794

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111000985.4A Pending CN113761202A (en) 2021-08-30 2021-08-30 Optimization system for mapping unstructured financial Excel table to database

Country Status (1)

Country Link
CN (1) CN113761202A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114595669A (en) * 2022-03-11 2022-06-07 北京梦诚科技有限公司 Calculation table identification method and system, electronic equipment and storage medium
CN115374204A (en) * 2022-08-25 2022-11-22 中国共产党张店区纪律检查委员会机关 Technical scheme for efficiently tracking fund source and destination
CN115658778A (en) * 2022-07-27 2023-01-31 重庆忽米网络科技有限公司 Excel data source-based data processing method for visual application creation
CN117077640A (en) * 2023-10-16 2023-11-17 鲁班(北京)电子商务科技有限公司 Unified processing method and device for multi-type dynamic table data

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0934863A (en) * 1995-07-14 1997-02-07 Hitachi Ltd Information integral processing method by neural network
US20140369602A1 (en) * 2013-06-14 2014-12-18 Lexmark International Technology S.A. Methods for Automatic Structured Extraction of Data in OCR Documents Having Tabular Data
US20160350652A1 (en) * 2015-05-29 2016-12-01 North Carolina State University Determining edit operations for normalizing electronic communications using a neural network
CN108427721A (en) * 2018-02-08 2018-08-21 湖南慧集网络科技有限责任公司 A kind of standardized method of the information on bidding based on database and system
CN110889310A (en) * 2018-09-07 2020-03-17 上海怀若智能科技有限公司 Financial document information intelligent extraction system and method
CN112883795A (en) * 2021-01-19 2021-06-01 贵州电网有限责任公司 Rapid and automatic table extraction method based on deep neural network
CN113239659A (en) * 2021-04-21 2021-08-10 上海快确信息科技有限公司 Text number extraction device integrating rules

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0934863A (en) * 1995-07-14 1997-02-07 Hitachi Ltd Information integral processing method by neural network
US20140369602A1 (en) * 2013-06-14 2014-12-18 Lexmark International Technology S.A. Methods for Automatic Structured Extraction of Data in OCR Documents Having Tabular Data
US20160350652A1 (en) * 2015-05-29 2016-12-01 North Carolina State University Determining edit operations for normalizing electronic communications using a neural network
CN108427721A (en) * 2018-02-08 2018-08-21 湖南慧集网络科技有限责任公司 A kind of standardized method of the information on bidding based on database and system
CN110889310A (en) * 2018-09-07 2020-03-17 上海怀若智能科技有限公司 Financial document information intelligent extraction system and method
CN112883795A (en) * 2021-01-19 2021-06-01 贵州电网有限责任公司 Rapid and automatic table extraction method based on deep neural network
CN113239659A (en) * 2021-04-21 2021-08-10 上海快确信息科技有限公司 Text number extraction device integrating rules

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
赵洪;王芳;: "大规模异构的政府统计报表信息抽取与集成融合研究", 情报学报, no. 09, 24 September 2020 (2020-09-24) *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114595669A (en) * 2022-03-11 2022-06-07 北京梦诚科技有限公司 Calculation table identification method and system, electronic equipment and storage medium
CN114595669B (en) * 2022-03-11 2022-09-27 北京梦诚科技有限公司 Calculation table identification method and system, electronic equipment and storage medium
CN115658778A (en) * 2022-07-27 2023-01-31 重庆忽米网络科技有限公司 Excel data source-based data processing method for visual application creation
CN115658778B (en) * 2022-07-27 2023-09-12 重庆忽米网络科技有限公司 Excel data source-based data processing method for visual application creation
CN115374204A (en) * 2022-08-25 2022-11-22 中国共产党张店区纪律检查委员会机关 Technical scheme for efficiently tracking fund source and destination
CN117077640A (en) * 2023-10-16 2023-11-17 鲁班(北京)电子商务科技有限公司 Unified processing method and device for multi-type dynamic table data
CN117077640B (en) * 2023-10-16 2024-02-02 鲁班(北京)电子商务科技有限公司 Unified processing method and device for multi-type dynamic table data

Similar Documents

Publication Publication Date Title
CN113761202A (en) Optimization system for mapping unstructured financial Excel table to database
CN111737969B (en) Resume parsing method and system based on deep learning
CN107766371B (en) Text information classification method and device
US6721451B1 (en) Apparatus and method for reading a document image
CN111753099A (en) Method and system for enhancing file entity association degree based on knowledge graph
CN102375807B (en) Method and device for proofing characters
CN101079025B (en) File correlation computing system and method
CN113722490B (en) Visual rich document information extraction method based on key value matching relation
US9558295B2 (en) System for data extraction and processing
CN111209728B (en) Automatic labeling and inputting method for test questions
CN112016481B (en) OCR-based financial statement information detection and recognition method
CN111274239A (en) Test paper structuralization processing method, device and equipment
CN112232195B (en) Handwritten Chinese character recognition method, device and storage medium
CN114153978A (en) Model training method, information extraction method, device, equipment and storage medium
Colter et al. Tablext: A combined neural network and heuristic based table extractor
Tarride et al. Large-scale genealogical information extraction from handwritten Quebec parish records
CN111768820A (en) Paper medical record digitization and target detection model training method, device and storage medium
CN115936624A (en) Basic level data management method and device
CN118170907A (en) Document intelligent label system based on deep neural network and implementation method thereof
CN112347121B (en) Configurable natural language sql conversion method and system
CN113642291B (en) Method, system, storage medium and terminal for constructing logical structure tree reported by listed companies
CN113779218B (en) Question-answer pair construction method, question-answer pair construction device, computer equipment and storage medium
CN114419645A (en) Contract intelligent analysis method based on AI
KR20010051459A (en) A place name expressing dictionary generating method and its apparatus
CN115410185A (en) Method for extracting specific name and unit name attributes in multi-modal data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Country or region after: China

Address after: 100000 411d, floor 4, No. 31, Fuchengmenwai street, Xicheng District, Beijing

Applicant after: Beijing Kuaique Information Technology Co.,Ltd.

Address before: 201700 2nd floor, building 1, 485 pucang Road, Qingpu District, Shanghai

Applicant before: Shanghai Kuaique Information Technology Co.,Ltd.

Country or region before: China