CN113326676A - Deep learning model device for structuring financial text into form - Google Patents

Deep learning model device for structuring financial text into form Download PDF

Info

Publication number
CN113326676A
CN113326676A CN202110415793.3A CN202110415793A CN113326676A CN 113326676 A CN113326676 A CN 113326676A CN 202110415793 A CN202110415793 A CN 202110415793A CN 113326676 A CN113326676 A CN 113326676A
Authority
CN
China
Prior art keywords
information
word
text
character
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110415793.3A
Other languages
Chinese (zh)
Inventor
周靖宇
景泳霖
袁阳平
邹鸿岳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Kuaique Information Technology Co ltd
Original Assignee
Shanghai Kuaique Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Kuaique Information Technology Co ltd filed Critical Shanghai Kuaique Information Technology Co ltd
Priority to CN202110415793.3A priority Critical patent/CN113326676A/en
Publication of CN113326676A publication Critical patent/CN113326676A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation
    • G06F40/157Transformation using dictionaries or tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/183Tabulation, i.e. one-dimensional positioning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Machine Translation (AREA)

Abstract

A deep learning model device for structuring financial texts into forms comprises the following technical scheme: preprocessing, cleaning data, segmenting text into words, forming characters and words, and labeling table lines; secondly, vectorizing words; step three, a character coding layer; step four, a connection layer of character coding and word coding; step five, predicting column information; step six, preprocessing of prediction of row information; step seven, predicting the line information; step eight, setting the total loss function. By means of the model, the unstructured text is directly converted into the table data, the data in the financial data field reaches the commercialization standard, compared with the Pipeline form, the data is improved by 3-5 percent, and the problem of error transmission of the Pipeline is reduced.

Description

Deep learning model device for structuring financial text into form
Technical Field
The invention relates to the technical field of information extraction and conversion, in particular to a deep learning model device for structuring financial texts into tables.
Background
In natural language processing, a common task is to classify text or extract information. Another problem is to extract structural information such as tables in the identification document.
In the fields of finance and the like, there are some deeper technical requirements, such as the problem of directly converting unstructured text into a table. As shown in fig. 1, the bidding data is the bid-altering data of a first-level bid, which means that "place 2.64% of the debt [02XX city to MTN002] is cancelled; meanwhile, 1 hundred million funds with 2.78 standard positions are changed into 4 million of 2.83 standard positions and 6 million of 2.96 standard positions, the problem is abstracted into a problem of structuring unstructured text information into table data, and in the aspect of semantic understanding, the problem is not only simple text classification or recognition intention, but also all elements need to be in one-to-one correspondence with a plurality of intents to form standard table data. This is a difficult technical problem in the current text processing field, and has a series of technical problems.
There is currently no readily available uniform technology for the problem of collating text into forms. The main approach is to split it into multiple sub-tasks and then handle such problems in a pipeline (pipeline) fashion. First, a text classification model is used to classify and judge the overall intention of the question (taking the above example as an example, whether the intention of the user is bid, bid-changing or bid-withdrawal is judged). Next, information extraction is performed, elements in the text are extracted by using a technology of Named Entity Recognition (NER) (in the above example, elements such as bond names, benchmarks, scalars and the like are extracted), and finally, the elements are arranged through a series of rules (for example, according to the positions of the elements before and after the word change) and combined into a form of a table.
First, the Pipeline (Pipeline) type processing method has a relatively large defect of error transmission. In the process of structuring the text into the table, three models are needed, one model is intention classification, the second model is element extraction, the third model is element structuring into the table, and the middle judgment needs to consider the line number (uncertain) of the table. With the existing better model algorithm, the accuracy is about 95%, and the final accuracy is about 80% -85% after the three models are fused by Pipeline. In order to be commercially available or to improve the accuracy, a series of various rules are required for correction and some fault-tolerant designs; secondly, the second defect of the Pipeline is adopted, the text is coded on the bottom layer, and the text is divided into a plurality of subtasks, so that the text is coded independently on each task, technical resources are wasted, and the structuring efficiency is reduced; secondly, related parameters cannot be shared, and the effect of improving the prediction accuracy is achieved; the intention is to judge the subtask, and the subtask is a multi-level classification problem; the first-level classification is used for judging whether to throw, change or withdraw the bidding, and the second-level classification is used for aiming at the intention of change and also needs to judge the elements before change and the elements after change. The existing classification model can not well solve the classification purpose, no related deep learning algorithm model exists in the structured form, a large degree of manual mode is adopted to comb out structured logic rules, and the elements are reordered through the rules. The scheme based on the rule engine needs a large amount of labor cost, meanwhile, the completeness of the rule cannot be guaranteed, and a plurality of problem rules cannot be covered. Secondly, due to the diversity of expression modes of people, the rules cannot be covered comprehensively, and can interfere and conflict with each other, so that the conditions of considering each other are easy to occur. Finally, the development cost and the maintenance cost are high, and whether the rules are effective or not and the influence on the previous rules are considered when a new rule is added for the rules combed at the early stage; development and maintenance costs are extremely high.
Disclosure of Invention
The invention aims to provide a deep learning model device for structuring financial texts into tables so as to solve the problems in the background technology, the invention adopts the idea of a joint model, a plurality of tasks such as text information extraction, table relation judgment, whether a column of information is formed and the like are fused into a multi-task model, a plurality of subtasks are solved through one model, and finally standard table structure information is formed, firstly, the original text is cleaned, the table information is arranged into a data form of model training, secondly, a multi-level model structure is adopted to code the text so as to realize element extraction in the text, then, the extracted elements are combined according to the table column form to form a plurality of lines of information, and the elements of each line are classified after secondary coding, the invention judges whether the text is effective line information or not, and finally realizes the table structuring of the text, the invention constructs a set of multi-level complex models, and realizes the aim of structuring the text into the table in one model, the learning and training process of the algorithm model is shown in figure 2, the invention provides a multi-task neural network, and the non-structural text is directly converted into the table data through one model, so that the commercialization standard is reached in the field of financial data, compared with the form of Pipeline, the invention improves 3-5 percentage points, and reduces the error transmission problem of Pipeline.
In order to achieve the purpose, the invention comprises the following technical scheme: preprocessing, cleaning data, segmenting text into words, forming characters and words, and labeling table lines; secondly, vectorizing words; step three, a character coding layer; step four, a connection layer of character coding and word coding; step five, predicting column information; step six, preprocessing of prediction of row information; step seven, predicting the line information; step eight, setting the total loss function.
The method comprises the following steps of preprocessing, data cleaning, and cleaning and replacing irregular data. For example, the method comprises the steps of performing word segmentation and cutting on text information, wherein the first dimension is an obvious segmentation symbol such as a space, a comma, a semicolon and a Tab key, dividing the text into short sentences, the second dimension is a regular expression, extracting elements such as characters and numbers in the text, dividing the short sentences into words with medium granularity such as the characters and the numbers, the third dimension is a jieba word segmentation, and the third dimension is a jieba word segmentation and word segmentationCharacters and numbers are cut into finer granularity, thereby forming words of three granularities, namely wordc,wordm,wordsFor word information of three granularities, because table information is two-dimensional information of N × M, the scheme divides the two-dimensional information into subtasks of two dimensions, for information in any cell, the information is divided into prediction of column positions and prediction of row positions, the column positions are associated with column name information, namely tasks for identifying named entities, each element is labeled as 'column name' information, for labeling of row information, information of each row is labeled as '0/1' classification problem, when all information conforming to the table rows is labeled as '1', when not conforming to the table rows, the label is labeled as '0'.
Step two is based on wordc,wordm,wordsThe method comprises the steps of firstly adopting word2vec (including but not limited) to vectorize the participles with different scales to obtain the vector characteristics of each participle and integrating the position structure information of the participle, carrying out structure coding on the position of each participle, constructing the position information of each word in the text under the condition of only one line or a plurality of lines of texts, representing the position information of each participle in the lines and columns of the text by using a connection matrix, wherein the connection matrix is defined as A [ i, j ]]1 (when two words are in the same vertical position, or adjacent left and right), otherwise a [ i, j [ ]]With 0, there are three tokens of different granularity, so there are three different connection matrices Ac i, j],Am[i,j]And As [ i, j ]]Performing vectorization training on the word information by adopting GCN; because each text segment has three participles with different granularities, we adopt the following GCN formula:
Figure BDA0003025821920000041
wherein the content of the first and second substances,
Figure BDA0003025821920000042
a is an adjacency matrix and I is an identity matrix;
Figure BDA0003025821920000043
for normalizing
Figure BDA0003025821920000044
H(t)、H(t+1)Respectively representing the coding of each node in the diagram at the t-th layer and the t + 1-th layer; w(l)Is a parameter to be learned; h(0)The method comprises the steps of firstly, obtaining three word vectors, namely X, wherein X is initial input, coding the three word vectors through a feature extraction formula of GCN, and obtaining vector codes of words with three different granularities, wherein the three word vectors are respectively Hc,HmAnd Hs
And thirdly, coding the character layer by adopting a pre-trained Albert model, and splicing a BilSTM layer on the character layer to be used as an embedding matrix TE.
The four character codes form a code matrix TE of each character, the participles with three different granularities are vectorized to form a code of the word, the participle codes and the character codes are fused by adopting a GAT algorithm, the participles are directly spliced behind the characters, an (N + M) × (N + M) adjacent matrix K is constructed assuming that the length of the characters is N and the number of the participles is M, and when the word contains the information of the characters, K [ i, j ] is used for constructing the adjacent matrix K]1, otherwise K [ i, j [ ]]Constructing three field matrixes K based on three different participles as 0c,KmAnd KsAnd splicing the words and character codes by using a GAT algorithm, wherein the GAT operation method is as follows, and in the GAT operation, the input of the t-th layer is a point set Ft={f1,f2,...,fNThere is also an adjacency matrix G, using GAT with multiple heads, the main calculation formula is as follows:
Figure BDA0003025821920000045
Figure BDA0003025821920000046
wherein, f'i∈RFRepresenting input characteristics of the node i;f’j∈RFrepresenting the output characteristic of node j; | represents a splicing operation; σ represents a nonlinear activation function; v. ofiA contiguous vertex representing i;
Figure BDA0003025821920000047
represents a node i
The weight of the edge connected to node j; wk∈RF’×FRepresenting a linear transformation matrix for linearly transforming the features;
Figure BDA0003025821920000051
and
Figure BDA0003025821920000052
respectively, the weight parameters of the feedforward neural network; shielding alpha using GkThe corresponding position.
The output of the last layer is obtained by t ═ 1, 2., N, respectively, and then the result of the last AF for GAT is calculated:
Figure BDA0003025821920000053
obtaining three different word segmentation and character fusion vector matrixes Q according to the formulac,QmAnd Qs
Combining the first vector matrix and the second vector matrix, and secondarily fusing the three vector matrixes with character vectors, wherein the aggregation formula is as follows: z ═ W1H+W2Qc+W3Qm+W4Qs
Wherein W1、W2、W3、W4H is the final vector matrix forming the character, which is the parameter matrix to be trained.
The fifth step is to label the text in series, similar to the task of named entity recognition, and label the characters of the text in the BIO form; and training the column information by using a cross entropy function (coordinated loss), wherein the loss function is defined as NER _ loss.
SaidAnd step six, extracting the character vector based on the result of column information prediction. Extracting the character information determined as an entity in consideration of the requirements of downstream tasks, and considering the reason that the lengths of each word in Chinese are different; to form basic vector information for predicting row information, a mean method is adopted to aggregate character vectors contained in each word, and the formula is as follows
Figure BDA0003025821920000054
Therefore, a word vector of each column is obtained, editable combination is carried out on information of each column to form row information, the process is an editable process, aiming at a general field, a mode of freely combining information of each column can be directly adopted to form various combinations of row information, a combination formula is assumed to have n columns, and M is extracted from a section of textiThe entity information of the ith column, then the SUM is formed as M1*M2*...*MnCombination information of seed lines, supplementary information: for a particular private domain, some rules for that domain may be added to form a line information combination. The formation of the mandatory information conforms to the rule requirements of the field, and the mandatory information is a freely editable module.
The seventh step is that firstly, each word vector of the randomly combined rows is encoded, the vector of each word is formed and is used as a node vector of the Graph network, secondly, the GAT operation is adopted in the scheme again, and the column information in each row which is freely combined is encoded and learned, wherein the operation method is the same as the fourth step, and is different from the fourth step only in an adjacent matrix G; and thus, vector information R of each row is formed, in the training process, because the row information is randomly combined, when the randomly combined row is in the marked row information, the result is 1, otherwise, the result is 0, and thus, the training process is consistent with the preprocessed row information, and the row information is trained and learned by using a cross entropy function (clustering loss) through comparison between prediction of the random combination and marked 0/1, wherein the loss is defined as structure _ loss.
Weighting the Loss functions of the eight columns and the eight rows to obtain a total Loss function Loss which is NER _ Loss + alpha structure _ Loss and serves as the total Loss function of the model, wherein alpha is an adjustable hyper-parameter, and training the model based on the Loss function; and finally obtaining the result of the model.
The working principle of the invention is as follows: the method comprises the steps of adopting a concept of a joint (join) model, fusing a plurality of tasks such as text information extraction, table relation judgment, whether a column of information is formed and the like into a multi-task model, solving a plurality of subtasks through one model by one model, and finally forming standard table structure information.
After the technical scheme is adopted, the invention has the beneficial effects that: by means of the model, the unstructured text is directly converted into the table data, the data in the financial data field reaches the commercialization standard, compared with the Pipeline form, the data is improved by 3-5 percent, and the problem of error transmission of the Pipeline is reduced.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a schematic diagram of a structure of a traditional financial field for directly converting unstructured text into a table;
FIG. 2 is a schematic diagram of the learning and training process of the algorithm model of the present invention;
FIG. 3 is a diagram of the deep learning model architecture for the text structured into tables based on graph attention in accordance with the present invention.
Detailed Description
Referring to fig. 1 to 3, the present invention comprises the following steps: preprocessing, cleaning data, segmenting text into words, forming characters and words, and labeling table lines; secondly, vectorizing words; step three, a character coding layer; step four, a connection layer of character coding and word coding; step five, predicting column information; step six, preprocessing of prediction of row information; step seven, predicting the line information; step eight, setting the total loss function.
Further, the first step is preprocessing, data cleaning, and irregular data cleaning and replacing are carried out. For example, the method comprises the steps of performing word segmentation and cutting on text information, wherein the first dimension is an obvious segmentation symbol such as a space, a comma, a semicolon and a Tab key, dividing the text into short sentences, the second dimension is a regular expression, extracting elements such as characters and numbers in the text, dividing the short sentences into words with medium granularity such as the characters and the numbers, the third dimension is a jieba word segmentation, and performing finer-grained cutting on the characters and the numbers, thereby forming words with three granularities, namely words with word and wordc,wordm,wordsFor word information of three granularities, because table information is two-dimensional information of N × M, the scheme divides the two-dimensional information into subtasks of two dimensions, for information in any cell, the information is divided into prediction of column positions and prediction of row positions, the column positions are associated with column name information, namely tasks for identifying named entities, each element is labeled as 'column name' information, for labeling of row information, information of each row is labeled as '0/1' classification problem, when all information conforming to the table rows is labeled as '1', when not conforming to the table rows, the label is labeled as '0'.
Further, the second step is based on wordc,wordm,wordsThe method comprises the steps of carrying out vectorization on the position information of the segmentation fusion words by three segmentation words with different scales, firstly carrying out vectorization on the segmentation words with different scales by adopting word2vec (including but not limited to), and obtaining the vector characteristics of each segmentation wordThe method includes the steps of figuring and merging position structure information of participles, carrying out structure coding on the position of each participle by the scheme, constructing the position information of each participle in the text under the condition that only one line or multiple lines of texts exist, and representing the position information of each participle in the lines and columns of the text by a connection matrix, wherein the connection matrix is defined as A [ i, j ]]1 (when two words are in the same vertical position, or adjacent left and right), otherwise a [ i, j [ ]]With 0, there are three tokens of different granularity, so there are three different connection matrices Ac i, j], Am[i,j]And As [ i, j ]]Performing vectorization training on the word information by adopting GCN; because each text segment has three participles with different granularities, we adopt the following GCN formula:
Figure BDA0003025821920000081
wherein the content of the first and second substances,
Figure BDA0003025821920000082
a is an adjacency matrix and I is an identity matrix;
Figure BDA0003025821920000083
for normalizing
Figure BDA0003025821920000084
H(t)、H(t+1)Respectively representing the coding of each node in the diagram at the t-th layer and the t + 1-th layer; w(l)Is a parameter to be learned; h(0)The method comprises the steps of firstly, obtaining three word vectors, namely X, wherein X is initial input, coding the three word vectors through a feature extraction formula of GCN, and obtaining vector codes of words with three different granularities, wherein the three word vectors are respectively Hc,HmAnd Hs
Further, the third step is to encode the character layer, and a layer of BilSTM is spliced on the character layer by adopting a pre-trained Albert model to be used as an embedding matrix TE.
Further, the step four character encoding forms an encoding matrix TE of each character, the three participles with different granularities are vectorized to form word encoding, and the GAT algorithm is adopted to carry out word encoding and character encodingMerging, directly splicing the participles behind the characters, constructing an (N + M) × (N + M) adjacency matrix K under the assumption that the length of the characters is N and the number of the participles is M, and when the words contain the information of the characters, K [ i, j ] is]1, otherwise K [ i, j [ ]]Constructing three field matrixes K based on three different participles as 0c,KmAnd KsAnd splicing the words and character codes by using a GAT algorithm, wherein the GAT operation method is as follows, and in the GAT operation, the input of the t-th layer is a point set Ft={f1,f2,...,fNThere is also an adjacency matrix G, using GAT with multiple heads, the main calculation formula is as follows:
Figure BDA0003025821920000085
Figure BDA0003025821920000086
wherein, f'i∈RFRepresenting input characteristics of the node i; f'j∈RFRepresenting the output characteristic of node j; | represents a splicing operation; σ represents a nonlinear activation function; v. ofiA contiguous vertex representing i;
Figure BDA0003025821920000087
a weight representing an edge connecting node i and node j; wk∈RF’×FRepresenting a linear transformation matrix for linearly transforming the features;
Figure BDA00030258219200000810
and
Figure BDA0003025821920000089
respectively, the weight parameters of the feedforward neural network; shielding alpha using GkThe corresponding position.
The output of the last layer is obtained by t ═ 1, 2., N, respectively, and then the result of the last AF for GAT is calculated:
Figure BDA0003025821920000091
obtaining three different word segmentation and character fusion vector matrixes Q according to the formulac,QmAnd Qs
Combining the first vector matrix and the second vector matrix, and secondarily fusing the three vector matrixes with character vectors, wherein the aggregation formula is as follows: z ═ W1H+W2Qc+W3Qm+W4Qs
Wherein W1、W2、W3、W4H is the final vector matrix forming the character, which is the parameter matrix to be trained.
Furthermore, the fifth step of labeling the text in series, similar to the task of named entity recognition, labeling the characters of the text in the BIO form; and training the column information by using a cross entropy function (coordinated loss), wherein the loss function is defined as NER _ loss.
Further, the sixth step extracts the character vector based on the result of the column information prediction. Extracting the character information determined as an entity in consideration of the requirements of downstream tasks, and considering the reason that the lengths of each word in Chinese are different; to form the basis vector information for the prediction of row information, we use mean
The method (2) aggregates the character vectors contained in each word, and the formula is
Figure BDA0003025821920000092
Therefore, a word vector of each column is obtained, editable combination is carried out on information of each column to form row information, the process is an editable process, aiming at a general field, a mode of freely combining information of each column can be directly adopted to form various combinations of row information, a combination formula is assumed to have n columns, and M is extracted from a section of textiThe entity information of the ith column, then the SUM is formed as M1*M2*...*MnThe combination information of the seed lines is,and (3) supplementary information: for a particular private domain, some rules for that domain may be added to form a line information combination. The formation of the mandatory information conforms to the rule requirements of the field, and the mandatory information is a freely editable module.
Further, the seventh step is to encode each word vector of the randomly combined rows, and based on the formed vector of each word, the vector is used as a node vector of the Graph network, and then the scheme adopts the GAT operation again to encode and learn the column information in each row which is freely combined, and the operation method is the same as the fourth step and is different from the fourth step only in the adjacent matrix G; and thus, vector information R of each row is formed, in the training process, because the row information is randomly combined, when the randomly combined row is in the marked row information, the result is 1, otherwise, the result is 0, and thus, the training process is consistent with the preprocessed row information, and the row information is trained and learned by using a cross entropy function (clustering loss) through comparison between prediction of the random combination and marked 0/1, wherein the loss is defined as structure _ loss.
Further, the Loss function weighting of eight columns and rows in the step obtains a total Loss function Loss ═ NER _ Loss + α structure _ Loss as the total Loss function of the model, where α is an adjustable hyper-parameter, and based on the Loss function, the model is trained; and finally obtaining the result of the model.
Furthermore, in the invention, jieba word segmentation is adopted, a feature dictionary is added, multi-granularity word segmentation is carried out, word2vec is adopted for vectorization, other word vectorization and word segmentation modes and new technology appearing in the future can be adopted, and the financial data is tabulated at present, but the technical patent is not limited to the financial data and can be applied to any other task needing to structure a section of text into a table.
The working principle of the invention is as follows: the method comprises the steps of adopting a concept of a joint (join) model, fusing a plurality of tasks such as text information extraction, table relation judgment, whether a column of information is formed and the like into a multi-task model, solving a plurality of subtasks through one model by one model, and finally forming standard table structure information.
After the technical scheme is adopted, the invention has the beneficial effects that: by means of the model, the unstructured text is directly converted into the table data, the data in the financial data field reaches the commercialization standard, compared with the Pipeline form, the data is improved by 3-5 percent, and the problem of error transmission of the Pipeline is reduced. The above description is only for the purpose of illustrating the technical solutions of the present invention and not for the purpose of limiting the same, and other modifications or equivalent substitutions made by those skilled in the art to the technical solutions of the present invention should be covered within the scope of the claims of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims (8)

1. A deep learning model apparatus for structuring financial text into forms, comprising: the method comprises the following steps: preprocessing, cleaning data, segmenting text into words, forming characters and words, and labeling table lines; secondly, vectorizing words; step three, a character coding layer; step four, a connection layer of character coding and word coding; step five, predicting column information; step six, preprocessing of prediction of row information; step seven, predicting the line information; step eight, setting the total loss function.
2. The apparatus of claim 1, wherein the deep learning model apparatus is further configured to structure financial text into a form, and wherein: the method comprises the following steps of preprocessing, data cleaning, and cleaning and replacing irregular data. Such as ' full angle and half angle conversion ', removing special symbols, such as ' emoticons ', etc., establishing a multi-dimensional word segmentation method, and performing word segmentation on text information, wherein the first dimension is ' space, comma, mark, TabKey ' and other obvious segmenters, dividing the text into short sentences, extracting the elements such as ' characters, numbers ' and the like in the text by adopting a regular expression in the second dimension, dividing the short sentences into words with medium granularity of ' characters and numbers ', dividing the words by adopting jieba in the third dimension, and cutting the characters and the numbers with finer granularity, thereby forming words with three granularities, namely wordc,wordm,wordsFor word information of three granularities, because table information is two-dimensional information of N × M, the scheme divides the two-dimensional information into subtasks of two dimensions, for information in any cell, the information is divided into prediction of column positions and prediction of row positions, the column positions are associated with column name information, namely tasks for identifying named entities, each element is labeled as 'column name' information, for labeling of row information, information of each row is labeled as '0/1' classification problem, when all information conforming to the table rows is labeled as '1', when not conforming to the table rows, the label is labeled as '0'.
3. The apparatus of claim 1, wherein the deep learning model apparatus is further configured to structure financial text into a form, and wherein: step two is based on wordc,wordm,wordsThe method comprises the steps of firstly adopting word2vec (including but not limited) to vectorize the participles with different scales to obtain the vector characteristics of each participle and integrating the position structure information of the participle, carrying out structure coding on the position of each participle, constructing the position information of each word in the text under the condition of only one line or a plurality of lines of texts, representing the position information of each participle in the lines and columns of the text by using a connection matrix, wherein the connection matrix is defined as A [ i, j ]]1 (when two words are in the same vertical position, or adjacent left and right), otherwise a [ i, j [ ]]With 0, there are three tokens of different granularity, so there are three different connection matrices Ac i, j],Am[i,j]And As [ i, j ]]Performing vectorization training on the word information by adopting GCN; because each text segment has three participles with different granularities, we adopt the following GCN formula:
Figure FDA0003025821910000021
wherein the content of the first and second substances,
Figure FDA0003025821910000022
a is an adjacency matrix and I is an identity matrix;
Figure FDA0003025821910000023
Figure FDA0003025821910000024
for normalizing
Figure FDA0003025821910000025
H(t)、H(t+1)Respectively representing the coding of each node in the diagram at the t-th layer and the t + 1-th layer; w(l)Is a parameter to be learned; h(0)The method comprises the steps of firstly, obtaining three word vectors, namely X, wherein X is initial input, coding the three word vectors through a feature extraction formula of GCN, and obtaining vector codes of words with three different granularities, wherein the three word vectors are respectively Hc,HmAnd Hs
4. The apparatus of claim 1, wherein the deep learning model apparatus is further configured to structure financial text into a form, and wherein: and thirdly, coding the character layer by adopting a pre-trained Albert model, and splicing a BilSTM layer on the character layer to be used as an embedding matrix TE.
5. The apparatus of claim 1, wherein the deep learning model apparatus is further configured to structure financial text into a form, and wherein: the four character codes form a code matrix TE of each character, the participles with three different granularities are vectorized to form a code of the word, the participle codes and the character codes are fused by adopting a GAT algorithm, the participles are directly spliced behind the characters, an (N + M) × (N + M) adjacent matrix K is constructed assuming that the length of the characters is N and the number of the participles is M, and when the word contains the charactersWhen K [ i, j ] is information]1, otherwise K [ i, j [ ]]Constructing three field matrixes K based on three different participles as 0c,KmAnd KsAnd splicing the words and character codes by using a GAT algorithm, wherein the GAT operation method is as follows, and in the GAT operation, the input of the t-th layer is a point set Ft={f1,f2,...,fNThere is also an adjacency matrix G, using GAT with multiple heads, the main calculation formula is as follows:
Figure FDA0003025821910000026
Figure FDA0003025821910000027
wherein, f'i∈RFRepresenting input characteristics of the node i; f'j∈RFRepresenting the output characteristic of node j; | represents a splicing operation; σ represents a nonlinear activation function; v. ofiA contiguous vertex representing i;
Figure FDA0003025821910000028
a weight representing an edge connecting node i and node j; wk∈RF ‘×FRepresenting a linear transformation matrix for linearly transforming the features;
Figure FDA0003025821910000031
and
Figure FDA0003025821910000032
respectively, the weight parameters of the feedforward neural network; shielding alpha using GkThe corresponding position.
The output of the last layer is obtained by t ═ 1, 2., N, respectively, and then the result of the last AF for GAT is calculated:
Figure FDA0003025821910000033
obtaining three different word segmentation and character fusion vector matrixes Q according to the formulac,QmAnd Qs
Combining the first vector matrix and the second vector matrix, and secondarily fusing the three vector matrixes with character vectors, wherein the aggregation formula is as follows:
Z=W1H+W2Qc+W3Qm+W4Qs
wherein W1、W2、W3、W4H is the final vector matrix forming the character, which is the parameter matrix to be trained.
6. The apparatus of claim 1, wherein the deep learning model apparatus is further configured to structure financial text into a form, and wherein: the fifth step is to label the text in series, similar to the task of named entity recognition, and label the characters of the text in the BIO form; and training the column information by using a cross entropy function (coordinated loss), wherein the loss function is defined as NER _ loss.
7. The apparatus of claim 1, wherein the deep learning model apparatus is further configured to structure financial text into a form, and wherein: and the sixth step of extracting the character vector based on the result of the column information prediction. Extracting the character information determined as an entity in consideration of the requirements of downstream tasks, and considering the reason that the lengths of each word in Chinese are different; to form basic vector information for predicting row information, a mean method is adopted to aggregate character vectors contained in each word, and the formula is as follows
Figure FDA0003025821910000034
Therefore, word vectors of each column are obtained, editable combination is carried out on information of each column to form row information, the process is an editable process, and free groups of information of each column can be directly adopted for a general fieldCombining mode, forming various line information combinations and combination formulas, supposing that there are n columns, extracting M from a textiThe entity information of the ith column, then the SUM is formed as M1*M2*...*MnCombination information of seed lines, supplementary information: for a particular private domain, some rules for that domain may be added to form a line information combination. The formation of the mandatory information conforms to the rule requirements of the field, and the mandatory information is a freely editable module.
8. The apparatus of claim 1, wherein the deep learning model apparatus is further configured to structure financial text into a form, and wherein: the seventh step is that firstly, each word vector of the randomly combined rows is encoded, the vector of each word is formed and is used as a node vector of the Graph network, secondly, the GAT operation is adopted in the scheme again, and the column information in each row which is freely combined is encoded and learned, wherein the operation method is the same as the fourth step, and is different from the fourth step only in an adjacent matrix G; thus, vector information R of each row is formed, in the training process, because the row information is randomly combined, when the row information of the random combination is in the row information of the label, the result is 1, otherwise, the result is 0, and thus the row information is consistent with the preprocessed row information, and the row information is trained and learned by adopting a cross entropy function (trajectory) through comparison between prediction of the random combination and 0/1 of the label, wherein the trajectory is defined as structure _ trajectory; weighting the Loss functions of the eight columns and the eight rows to obtain a total Loss function Loss which is NER _ Loss + alpha structure _ Loss and serves as the total Loss function of the model, wherein alpha is an adjustable hyper-parameter, and training the model based on the Loss function; and finally obtaining the result of the model.
CN202110415793.3A 2021-04-19 2021-04-19 Deep learning model device for structuring financial text into form Pending CN113326676A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110415793.3A CN113326676A (en) 2021-04-19 2021-04-19 Deep learning model device for structuring financial text into form

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110415793.3A CN113326676A (en) 2021-04-19 2021-04-19 Deep learning model device for structuring financial text into form

Publications (1)

Publication Number Publication Date
CN113326676A true CN113326676A (en) 2021-08-31

Family

ID=77414835

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110415793.3A Pending CN113326676A (en) 2021-04-19 2021-04-19 Deep learning model device for structuring financial text into form

Country Status (1)

Country Link
CN (1) CN113326676A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113761131A (en) * 2021-09-07 2021-12-07 上海快确信息科技有限公司 Deep learning model device for structuring text into form

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109299262A (en) * 2018-10-09 2019-02-01 中山大学 A kind of text implication relation recognition methods for merging more granular informations
CN110334354A (en) * 2019-07-11 2019-10-15 清华大学深圳研究生院 A kind of Chinese Relation abstracting method
CN110472235A (en) * 2019-07-22 2019-11-19 北京航天云路有限公司 A kind of end-to-end entity relationship joint abstracting method towards Chinese text
US20200065374A1 (en) * 2018-08-23 2020-02-27 Shenzhen Keya Medical Technology Corporation Method and system for joint named entity recognition and relation extraction using convolutional neural network
US20200090034A1 (en) * 2018-09-18 2020-03-19 Salesforce.Com, Inc. Determining Intent from Unstructured Input to Update Heterogeneous Data Stores
CN111309915A (en) * 2020-03-03 2020-06-19 爱驰汽车有限公司 Method, system, device and storage medium for training natural language of joint learning
US20210011974A1 (en) * 2019-07-12 2021-01-14 Adp, Llc Named-entity recognition through sequence of classification using a deep learning neural network

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200065374A1 (en) * 2018-08-23 2020-02-27 Shenzhen Keya Medical Technology Corporation Method and system for joint named entity recognition and relation extraction using convolutional neural network
US20200090034A1 (en) * 2018-09-18 2020-03-19 Salesforce.Com, Inc. Determining Intent from Unstructured Input to Update Heterogeneous Data Stores
CN109299262A (en) * 2018-10-09 2019-02-01 中山大学 A kind of text implication relation recognition methods for merging more granular informations
CN110334354A (en) * 2019-07-11 2019-10-15 清华大学深圳研究生院 A kind of Chinese Relation abstracting method
US20210011974A1 (en) * 2019-07-12 2021-01-14 Adp, Llc Named-entity recognition through sequence of classification using a deep learning neural network
CN110472235A (en) * 2019-07-22 2019-11-19 北京航天云路有限公司 A kind of end-to-end entity relationship joint abstracting method towards Chinese text
CN111309915A (en) * 2020-03-03 2020-06-19 爱驰汽车有限公司 Method, system, device and storage medium for training natural language of joint learning

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
DHRUV GUPTA ET AL.: "Weaving Text into Tables", PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT (CIKM ’20), 23 October 2020 (2020-10-23), pages 3401 - 3404, XP059105897, DOI: 10.1145/3340531.3417442 *
PANKAJ GUPTA ET AL.: "Table Filling Multi-Task Recurrent Neural Network for Joint Entity and Relation Extraction", PROCEEDINGS OF COLING 2016, THE 26TH INTERNATIONAL CONFERENCE ON COMPUTATIONAL LINGUISTICS: TECHNICAL PAPERS, 17 December 2016 (2016-12-17), pages 2537 - 2547, XP055737003 *
任鹏程等: "依存约束的图网络实体关系联合抽取", 计算机系统应用, vol. 30, no. 3, 15 March 2021 (2021-03-15), pages 23 - 32 *
王晓霞等: "基于注意力与图卷积网络的关系抽取模型", 计算机应用, vol. 41, no. 2, 10 February 2021 (2021-02-10), pages 350 - 356 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113761131A (en) * 2021-09-07 2021-12-07 上海快确信息科技有限公司 Deep learning model device for structuring text into form

Similar Documents

Publication Publication Date Title
Aguilar et al. A multi-task approach for named entity recognition in social media data
Saad et al. Twitter sentiment analysis based on ordinal regression
CN110020438B (en) Sequence identification based enterprise or organization Chinese name entity disambiguation method and device
CN111767725B (en) Data processing method and device based on emotion polarity analysis model
CN110222188A (en) A kind of the company's bulletin processing method and server-side of multi-task learning
CN112434535B (en) Element extraction method, device, equipment and storage medium based on multiple models
CN115688776B (en) Relation extraction method for Chinese financial text
CN114896388A (en) Hierarchical multi-label text classification method based on mixed attention
CN113255321A (en) Financial field chapter-level event extraction method based on article entity word dependency relationship
CN111859983A (en) Natural language labeling method based on artificial intelligence and related equipment
CN113515632A (en) Text classification method based on graph path knowledge extraction
CN116383399A (en) Event public opinion risk prediction method and system
CN111709225B (en) Event causal relationship discriminating method, device and computer readable storage medium
CN113869055A (en) Power grid project characteristic attribute identification method based on deep learning
CN115600605A (en) Method, system, equipment and storage medium for jointly extracting Chinese entity relationship
CN115292490A (en) Analysis algorithm for policy interpretation semantics
CN115098673A (en) Business document information extraction method based on variant attention and hierarchical structure
CN112948588B (en) Chinese text classification method for quick information editing
CN113869054A (en) Deep learning-based electric power field project feature identification method
CN112231476B (en) Improved graphic neural network scientific literature big data classification method
CN113326676A (en) Deep learning model device for structuring financial text into form
CN113761131A (en) Deep learning model device for structuring text into form
CN111666375A (en) Matching method of text similarity, electronic equipment and computer readable medium
CN111259106A (en) Relation extraction method combining neural network and feature calculation
CN116108127A (en) Document level event extraction method based on heterogeneous graph interaction and mask multi-head attention mechanism

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination