CN112329767A - Contract text image key information extraction system and method based on joint pre-training - Google Patents

Contract text image key information extraction system and method based on joint pre-training Download PDF

Info

Publication number
CN112329767A
CN112329767A CN202011106010.5A CN202011106010A CN112329767A CN 112329767 A CN112329767 A CN 112329767A CN 202011106010 A CN202011106010 A CN 202011106010A CN 112329767 A CN112329767 A CN 112329767A
Authority
CN
China
Prior art keywords
training
image
text
layer
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011106010.5A
Other languages
Chinese (zh)
Inventor
杨威
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fangzheng Zhushi Wuhan Technology Development Co ltd
Original Assignee
Fangzheng Zhushi Wuhan Technology Development Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fangzheng Zhushi Wuhan Technology Development Co ltd filed Critical Fangzheng Zhushi Wuhan Technology Development Co ltd
Priority to CN202011106010.5A priority Critical patent/CN112329767A/en
Publication of CN112329767A publication Critical patent/CN112329767A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/62Text, e.g. of license plates, overlay texts or captions on TV images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Multimedia (AREA)
  • Character Input (AREA)

Abstract

The invention relates to a contract text image key information extraction system and method based on joint pre-training, wherein the system comprises: the pre-training model is obtained by inputting a plurality of contract text images to perform pre-training task training, and the pre-training task comprises text prediction based on image positions; the training model is obtained by inputting a plurality of contract text images marked with positions for extracting information to train a task, and the training task comprises information extraction by using a pre-training model; inputting the contract text image to be detected into the trained training model to obtain the position and characters of the predefined extracted information of the training model; the newly added pre-training task not only fuses image features, but also fuses a character prediction task, so that the model can learn more prior knowledge, a large amount of manpower is saved due to the fact that data do not need to be marked in the pre-training stage, and the accuracy of information extraction is higher.

Description

Contract text image key information extraction system and method based on joint pre-training
Technical Field
The invention relates to the field of extraction of text image information, in particular to a contract text image key information extraction system and method based on joint pre-training.
Background
The extraction of the key information of the contract text image refers to extracting the key information which is interested by the user and needs to be extracted from the contract scanned piece or the contract image by using some methods, such as entities such as contract signing 'A party' and 'B party', contract signing time of the contract, contract amount of the contract and the like.
Nowadays, many companies still use the traditional method to search all the entities to be extracted, including "party a, party b, contract time, contract amount" and the like, in the contract one by one through extracting the entities one by one from the commercial contract by manpower, which not only consumes time but also has great labor cost.
On the other hand, there are also many companies that try to extract key information from the contract text using an automated extraction method.
OCR, an optical character recognition technique, is currently widely used in many fields such as handwritten character recognition, key information recognition based on photographs such as bank card identification cards, and character recognition of contract text images. Meanwhile, with the rapid development of deep learning algorithms, Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Attention mechanisms (Attention mechanisms) are successfully applied to OCR applications.
For example, chinese patent application publication No. CN110458162A discloses a method for automatically extracting text information from an end to an end based on deep learning (convolutional neural network), which directly trains a model based on a deep learning algorithm by labeling a large amount of training data, and the model can be used in actual recognition work after completion.
However, no matter the deep learning algorithm based on CNN, Faster R-CNN, Mask R-CNN or GCN, the method has two obvious defects in automatically extracting contract key information:
1. they need to manually label a large amount of label training data, that is, to label the exact position of entities such as party a, party b, contract time, contract amount, etc. for each contract, which is expensive and results in a long project period.
2. The model spends a large amount of time learning information such as structure, layout, position relation and the like of the contract text, the marked data do not play the maximum monitoring and learning effect, and finally a less-ideal recognition effect is caused.
Disclosure of Invention
The invention provides a contract text image key information extraction system and method based on joint pre-training, aiming at the technical problems in the prior art, and solving the problems in the prior art.
The technical scheme for solving the technical problems is as follows: a contract text image key information extraction system based on joint pre-training comprises: pre-training a model and training a model;
the pre-training model is obtained by inputting a plurality of contract text images to perform pre-training task training, wherein the pre-training task comprises text prediction based on image positions;
the training model is obtained by inputting a plurality of contract text images marked with positions for extracting information to train a training task, wherein the training task comprises information extraction by using the pre-training model;
and inputting the contract text image to be detected into the trained model after training, and obtaining the position and characters of the predefined extracted information of the trained model.
A contract text image key information extraction method based on joint pre-training comprises the following steps:
step 1, defining a pre-training model and a pre-training task of text prediction based on image positions, inputting a plurality of contract text images to the pre-training model, calculating a target function according to the pre-training task, and then updating parameters of the pre-training model;
step 2, defining a training model and a training task for extracting information by using the pre-training model, inputting a plurality of contract text images marked with positions for extracting information to the training model, calculating a target function according to the training task, and then updating parameters of the training model;
and 3, inputting the contract text image to be detected into the trained model after training, and obtaining the position and characters of the predefined extracted information of the training model.
The invention has the beneficial effects that: the method is characterized in that a pre-training task is added to an information extraction method based on a deep learning algorithm in the prior art, the pre-training task integrates image characteristics and a character prediction task, so that more prior knowledge is learned by a model, and a large amount of manpower is saved as data does not need to be marked in a pre-training stage; compared with a single pre-training mode based on texts or images, the image and text-based combined pre-training mode has a better effect and higher accuracy of information extraction.
On the basis of the technical scheme, the invention can be further improved as follows.
Further, the pre-training model comprises: the system comprises an image preprocessing module I, a text and image embedding layer, an attention layer and a loss function layer I;
the first image preprocessing module performs character recognition on an input contract text image through an OCR tool and acquires the position of each character;
the text and image embedding layer embeds each character into a text and image embedding vector according to the position of the character;
the attention layer comprises a plurality of layers, the text and image embedding layer is used as input of a first layer, and each layer is output to a next layer after being operated by an attention mechanism;
and the loss function layer calculates and updates the parameters of the pre-training model through a loss function.
Further, the training model includes: the image preprocessing module II, the pre-training layer and the loss function layer II are used for carrying out image preprocessing;
the second image preprocessing module performs character recognition on the input contract text image marked with the position of the extracted information through an OCR tool and acquires the position of each character;
the pre-training layer comprises a text and image embedding layer and an attention layer in the pre-training model after training is completed, and pre-training is carried out on the contract text image marked with the position of the extracted information;
and the input of the second loss function layer is a prediction label for extracting information from the training model and a real label in a training set, and the parameters of the training model are updated according to the comparison result of the prediction label and the real label.
Further, the process of the first image preprocessing module and the second image preprocessing module for acquiring the position of each character includes:
and acquiring the horizontal coordinate and the vertical coordinate of the upper left corner and the lower right corner of each character minimum image block, and arranging the characters into a row according to the magnitude sequence of the horizontal coordinate and the vertical coordinate.
Further, the text and image embedding vectors include a text embedding layer and a 2-D coordinate embedding layer, the 2-D coordinate embedding layer including: the device comprises a text minimum image block left upper corner horizontal coordinate embedding layer, a text minimum image block left upper corner vertical coordinate embedding layer, a text minimum image block right lower corner horizontal coordinate embedding layer and a text minimum image block right lower corner vertical coordinate embedding layer.
Further, the attention mechanism adopts a multi-head self-attention mechanism or a common self-attention mechanism;
the attention mechanism adopts a calculation form when an ordinary self-attention mechanism is adopted as follows:
Figure BDA0002726981420000041
q, K, V are all tensors, dkThe tensor K last dimension is represented and T represents the transpose operation.
Further, the loss function is calculated in the following manner: the BERT based MLM task calculates the penalty given the 2-D coordinates of the text minimum image block.
Further, the step 1 of training the pre-training model includes:
performing character recognition on an input contract text image through an OCR tool, and acquiring the position of each character;
embedding each character into a text and image embedding vector according to the position of the character;
designing an attention layer with a multi-layer structure, taking the text and image embedding layer as the input of a first layer of the attention layer, and outputting each layer to the next layer after the attention mechanism operation;
and calculating and updating the parameters of the pre-training model through a loss function.
Further, the step 2 of training the training model includes:
performing character recognition on the input contract text image marked with the position of the extracted information through an OCR tool, and acquiring the position of each character;
pre-training the contract text image marked with the position of the extracted information according to a pre-training layer; the pre-training layer is a text and image embedding layer and an attention layer in the pre-training model after training is finished;
and updating the parameters of the training model according to the comparison result of the predicted label and the real label.
The beneficial effect of adopting the further scheme is that: the pre-training task enables the pre-training model and the training model to learn language features and word meanings, meanwhile, enables the pre-training model and the training model to learn special rule features such as contracts and layout, visual information and the like, and the pre-training model is obviously better in performance of a key information extraction task in the next stage.
Drawings
FIG. 1 is a block diagram of a contract text image key information extraction system based on joint pre-training according to the present invention;
FIG. 2 is a schematic structural diagram of an embodiment of a pre-training model of a contract text image key information extraction system provided in the present invention;
FIG. 3 is a schematic structural diagram of an embodiment of a training model of a contract text image key information extraction system provided by the present invention;
FIG. 4 is a flowchart of an embodiment of a contract text image key information extraction method based on joint pre-training according to the present invention.
Detailed Description
The principles and features of this invention are described below in conjunction with the following drawings, which are set forth by way of illustration only and are not intended to limit the scope of the invention.
The invention provides a contract text image key information extraction system based on joint pre-training, as shown in fig. 1, which is a structural block diagram of the contract text image key information extraction system based on joint pre-training provided by the invention, as can be seen from fig. 1, the system comprises: a pre-training model and a training model.
The pre-training model is obtained by inputting a plurality of contract text images and performing pre-training task training, wherein the pre-training task comprises text prediction based on image positions.
The training model is obtained by inputting a plurality of contract text images marked with positions for extracting information to train a training task, wherein the training task comprises information extraction by utilizing a pre-training model.
And inputting the contract text image to be detected into the trained training model to obtain the position and characters of the predefined extracted information of the training model.
Specifically, the predefined extraction information may be words or characters set according to requirements.
Unlike OCR technology based on deep learning algorithm with only one training stage to identify key information of contract text image, the present invention uses not only deep learning algorithm but most importantly the technical means of joint pre-training based on image and text.
Many current solutions use a pre-trained model, such as a BERT model based on natural language pre-training, and a Mask R-CNN model based on image pre-training.
Taking a natural language pre-training model as an example, the training can be completed by utilizing a pure text without a large amount of manually labeled data in the pre-training stage, the model can learn the language rule and understand the semantics of words in the stage, and the model has a good effect in the next extraction task after the learning.
Similarly, in the field of image pre-training models, by pre-training the image models, the models master the rules and visual characteristics of common images, and further have better effect on the performance of the next extraction task.
The invention uses a pre-training technology combining images and language texts, aims to enable a model to learn language features and word meanings and also enable the model to learn rule features such as contracts and special typesetting layout, visual information and the like, and the pre-trained model is obviously superior in key information extraction task of the next stage.
Taking contract information extraction as an example, given a keyword "contract amount" in a contract, the corresponding contract amount value is likely to be on the right of or below the four words of "contract amount" and is unlikely to be on the left and above, and the important rule can be learned through the model in the pre-training stage.
In addition, in special texts such as contracts, the visual features of the text also imply information, such as whether the text is bold, underlined, and italicized, which is also important for the information extraction task.
The invention not only adds a pre-training task on the information extraction method based on the deep learning algorithm in the prior art, but also integrates the image characteristics and the character prediction task, so that the model can learn more priori knowledge and a large amount of manpower is saved because the pre-training stage does not need to label data; compared with a single pre-training mode based on texts or images, the image and text-based combined pre-training mode has a better effect and higher accuracy of information extraction.
Example 1
Embodiment 2 provided by the present invention is an embodiment of a contract text image key information extraction system based on joint pre-training, which includes: a pre-training model and a training model.
The pre-training model is obtained by inputting a plurality of contract text images and performing pre-training task training, wherein the pre-training task comprises text prediction based on image positions.
Preferably, as shown in fig. 2, a schematic structural diagram of an embodiment of a pre-training model of a contract text image key information extraction system provided by the present invention is shown, and as can be seen from fig. 2, the pre-training model includes: the image pre-processing module I, the text and image embedding layer, the attention layer and the loss function layer I.
And the first image preprocessing module performs character recognition on the input contract text image through an OCR tool and acquires the position of each character.
The text and image embedding layer embeds each word into a text and image embedding vector according to the position of the word, and particularly, the text and image embedding vector is a high-dimensional mathematical vector space.
The attention layer comprises a plurality of layers, the text and image embedding layer is used as input of the first layer, and each layer is output to the next layer after the attention mechanism operation.
Preferably, the attention mechanism may take many forms, such as a common multi-headed self-attention mechanism or a common self-attention mechanism.
The attention mechanism adopts a calculation form when the ordinary self-attention mechanism is adopted as follows:
Figure BDA0002726981420000081
q, K, V are all tensors, dkThe tensor K last dimension is represented and T represents the transpose operation.
Specifically, the number N of the attention layers may be selected from 6 or 12, and once determined, the whole process is not changed.
The LOSS function layer LOSS calculates and updates the parameters of the pre-training model through a LOSS function.
The purpose of calculating the LOSS function LOSS is to expect updates to parameters in the pre-trained model (e.g., text and image embedding vectors).
Preferably, the LOSS function LOSS is calculated in various ways, which may specifically be: given the 2-D coordinates of the smallest image block of a character, the loss is calculated based on the MLM (masked language model) task of BERT (Bidirectional Encoder Representation from Transformers), the pre-training target of BERT comprises two, MLM blocks a certain proportion of words in the word sequence through Mask, and the model is guided to be able to successfully predict words to Mask positions.
For example, given a certain text sequence, the MLM task will be the 3 rd character MASK, and the model needs to predict the text based on the feature vector of the text image and the text feature vector, and finally outputs a prediction distribution P (w3| w1, w2, w4 … wn, image _ vecs); wherein w1, w2, …, wn represents characters, image _ vecs represents an image coding vector of a minimum image block of each character of n characters, and an image coding model adopts Faster R-CNN.
The training model is obtained by inputting a plurality of contract text images marked with positions for extracting information to train a training task, wherein the training task comprises information extraction by utilizing a pre-training model.
Preferably, as shown in fig. 3, which is a schematic structural diagram of an embodiment of a training model of a contract text image key information extraction system provided by the present invention, as can be seen from fig. 3, the training model includes: the image pre-processing module II, the pre-training layer and the loss function layer II.
And the second image preprocessing module performs character recognition on the input contract text image marked with the position of the extracted information through an OCR tool and acquires the position of each character.
The pre-training layer comprises a text and image embedding layer and an attention layer in a trained pre-training model, and pre-training is carried out on contract text images marked with positions of extracted information.
And the input of the second loss function layer is a prediction label for information extraction of the training model and a real label in the training set, and the parameters of the training model are updated according to the comparison result of the prediction label and the real label.
The training model takes the information extraction task as a sequence marking task, and the model used by the pre-training layer and the network model and the parameters of the pre-training model after the training are completed are used by the parameters. Specifically, the training set can be obtained by performing BIO labeling on all the characters or characters, wherein the BIO labeling labels each element as "B-X", "I-X", or "O". Wherein "B-X" indicates that the fragment in which the element is located belongs to X type and the element is at the beginning of the fragment, "I-X" indicates that the fragment in which the element is located belongs to X type and the element is in the middle position of the fragment, and "O" indicates that the fragment does not belong to any type.
Specifically, the process of acquiring the position of each character by the image preprocessing module i and the image preprocessing module ii includes:
and acquiring horizontal coordinates and vertical coordinates (referring to relative coordinates of a fixed point) of the upper left corner and the lower right corner of the minimum image block of each character, and arranging the characters into a row according to the magnitude sequence of the horizontal coordinates and the vertical coordinates.
Further, as can be seen from fig. 2, the text and image embedding vectors include a text embedding layer and a 2-D coordinate embedding layer, and the 2-D coordinate embedding layer is divided into four layers, including: the device comprises a text minimum image block left upper corner horizontal coordinate embedding layer, a text minimum image block left upper corner vertical coordinate embedding layer, a text minimum image block right lower corner horizontal coordinate embedding layer and a text minimum image block right lower corner vertical coordinate embedding layer.
Taking the top left horizontal coordinate embedding layer as an example, when the top left horizontal coordinates of different characters are the same, the top left horizontal coordinate embedding vectors of the characters are the same.
In addition, the initial text and image embedding vectors may be randomized (e.g., sampled from X-N (0, 1)) and uniform vector dimensions (e.g., 768 dimensions) determined; the vectors of the embedding layer are continuously updated as the pre-training process continues.
And inputting the contract text image to be detected into the trained training model to obtain the position and characters of the predefined extracted information of the training model.
Example 2
Embodiment 2 provided by the present invention is an embodiment of a contract text image key information extraction method based on joint pre-training provided by the present invention, and as shown in fig. 4, is a flowchart of an embodiment of a contract text image key information extraction method based on joint pre-training provided by the present invention, and as can be seen from fig. 4, the embodiment of the method includes:
step 1, defining a pre-training model and a pre-training task of text prediction based on image positions, inputting a plurality of contract text images to the pre-training model, calculating a target function according to the pre-training task, and then updating parameters of the pre-training model.
Specifically, N closed text images are prepared, wherein text information and the like do not need to be marked; the network of pre-trained models can use, for example, a BERT backbone network, then run through the contract text images one by one for OCR character recognition, and finally pre-trained based on the set pre-training task.
Preferably, the process of training the pre-training model comprises:
and performing character recognition on the input contract text image through an OCR tool, and acquiring the position of each character.
Each word is embedded into the text and image embedding vectors according to its position.
And designing an attention layer of a multi-layer structure, taking a text and image embedding layer as an input of a first layer of the attention layer, and outputting each layer to the next layer after the attention mechanism operation.
And calculating and updating the parameters of the pre-training model through a loss function.
Specifically, in the pre-training process, an implementer needs to sequentially process N prepared images of the same text, firstly perform OCR character recognition to obtain the 2-D coordinates of the minimum image block of the characters and the characters, enter a text and image embedding layer to obtain a text embedding vector and an image 2-D coordinate embedding vector, then perform attention calculation on the vectors, finally calculate a target function according to a pre-training task, and then update the parameters of the whole pre-training network model.
And counting 1 Epoch in the traversal of all the N text images, and carrying out custom training on the sizes of the epochs according to the training effect.
And 2, defining a training model and a training task for extracting information by using the pre-training model, inputting a plurality of contract text images marked with positions for extracting information to the training model, calculating a target function according to the training task, and updating parameters of the training model.
Specifically, M contract text images marked with positions for extracting information are marked, a network layer suitable for a downstream information extraction task is newly added on the basis of the pre-training network trained in the step 1, and then M contract text images are traversed one by one and OCR character recognition is carried out formally for training.
Preferably, the training process of the training model includes:
and performing character recognition on the input contract text image marked with the position of the extracted information through an OCR tool, and acquiring the position of each character.
Pre-training the contract text image marked with the position of the extracted information according to a pre-training layer; the pre-training layer is a text and image embedding layer and an attention layer in the pre-training model after training.
And updating the parameters of the training model according to the comparison result of the predicted label and the real label.
Specifically, in the training process, an implementer needs to add an FC layer (e.g., a full-connection layer network with an output dimension of N _ CLASS) based on a downstream extraction task based on a pre-training model (a text and image embedding layer + an attention layer), then traverse M pieces of the same text image one by one, calculate an objective function according to the training task, and then update model parameters of a neural network.
And counting 1 Epoch in the traversal of all the M text images once, and custom-training the sizes of the epochs according to the training effect.
And 3, inputting the contract text image to be detected into the trained training model to obtain the predefined position and characters of the extracted information of the training model.
After the pre-training process and the training process are finished, the user can formally enter a use stage:
inputting text images, recognizing characters by an OCR algorithm, and automatically extracting the positions and characters of predefined information in the text by the model.
The present invention is not limited to the above preferred embodiments, and any modifications, equivalent replacements, improvements, etc. within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A contract text image key information extraction system based on joint pre-training is characterized by comprising: pre-training a model and training a model;
the pre-training model is obtained by inputting a plurality of contract text images to perform pre-training task training, wherein the pre-training task comprises text prediction based on image positions;
the training model is obtained by inputting a plurality of contract text images marked with positions for extracting information to train a training task, wherein the training task comprises information extraction by using the pre-training model;
and inputting the contract text image to be detected into the trained model after training, and obtaining the position and characters of the predefined extracted information of the trained model.
2. The system of claim 1, wherein the pre-trained model comprises: the system comprises an image preprocessing module I, a text and image embedding layer, an attention layer and a loss function layer I;
the first image preprocessing module performs character recognition on an input contract text image through an OCR tool and acquires the position of each character;
the text and image embedding layer embeds each character into a text and image embedding vector according to the position of the character;
the attention layer comprises a plurality of layers, the text and image embedding layer is used as input of a first layer, and each layer is output to a next layer after being operated by an attention mechanism;
and the loss function layer calculates and updates the parameters of the pre-training model through a loss function.
3. The system of claim 2, wherein the training model comprises: the image preprocessing module II, the pre-training layer and the loss function layer II are used for carrying out image preprocessing;
the second image preprocessing module performs character recognition on the input contract text image marked with the position of the extracted information through an OCR tool and acquires the position of each character;
the pre-training layer comprises a text and image embedding layer and an attention layer in the pre-training model after training is completed, and pre-training is carried out on the contract text image marked with the position of the extracted information;
and the input of the second loss function layer is a prediction label for extracting information from the training model and a real label in a training set, and the parameters of the training model are updated according to the comparison result of the prediction label and the real label.
4. The system of claim 3, wherein the first image pre-processing module and the second image pre-processing module for obtaining the position of each character comprises:
and acquiring the horizontal coordinate and the vertical coordinate of the upper left corner and the lower right corner of each character minimum image block, and arranging the characters into a row according to the magnitude sequence of the horizontal coordinate and the vertical coordinate.
5. The system of claim 3, wherein the text and image embedding vectors comprise a text embedding layer and a 2-D coordinate embedding layer, the 2-D coordinate embedding layer comprising: the device comprises a text minimum image block left upper corner horizontal coordinate embedding layer, a text minimum image block left upper corner vertical coordinate embedding layer, a text minimum image block right lower corner horizontal coordinate embedding layer and a text minimum image block right lower corner vertical coordinate embedding layer.
6. The system of claim 2, wherein the attention mechanism is a multi-headed self-attention mechanism or a common self-attention mechanism;
the attention mechanism adopts a calculation form when an ordinary self-attention mechanism is adopted as follows:
Figure FDA0002726981410000021
q, K, V are all tensors, dkThe tensor K last dimension is represented and T represents the transpose operation.
7. The system of claim 2, wherein the loss function is calculated by: the BERT based MLM task calculates the penalty given the 2-D coordinates of the text minimum image block.
8. A contract text image key information extraction method based on joint pre-training is characterized by comprising the following steps:
step 1, defining a pre-training model and a pre-training task of text prediction based on image positions, inputting a plurality of contract text images to the pre-training model, calculating a target function according to the pre-training task, and then updating parameters of the pre-training model;
step 2, defining a training model and a training task for extracting information by using the pre-training model, inputting a plurality of contract text images marked with positions for extracting information to the training model, calculating a target function according to the training task, and then updating parameters of the training model;
and 3, inputting the contract text image to be detected into the trained model after training, and obtaining the position and characters of the predefined extracted information of the training model.
9. The method of claim 8, wherein the step 1 training the pre-trained model comprises:
performing character recognition on an input contract text image through an OCR tool, and acquiring the position of each character;
embedding each character into a text and image embedding vector according to the position of the character;
designing an attention layer with a multi-layer structure, taking the text and image embedding layer as the input of a first layer of the attention layer, and outputting each layer to the next layer after the attention mechanism operation;
and calculating and updating the parameters of the pre-training model through a loss function.
10. The method of claim 9, wherein the step 2 training the training model comprises:
performing character recognition on the input contract text image marked with the position of the extracted information through an OCR tool, and acquiring the position of each character;
pre-training the contract text image marked with the position of the extracted information according to a pre-training layer; the pre-training layer is a text and image embedding layer and an attention layer in the pre-training model after training is finished;
and updating the parameters of the training model according to the comparison result of the predicted label and the real label.
CN202011106010.5A 2020-10-15 2020-10-15 Contract text image key information extraction system and method based on joint pre-training Pending CN112329767A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011106010.5A CN112329767A (en) 2020-10-15 2020-10-15 Contract text image key information extraction system and method based on joint pre-training

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011106010.5A CN112329767A (en) 2020-10-15 2020-10-15 Contract text image key information extraction system and method based on joint pre-training

Publications (1)

Publication Number Publication Date
CN112329767A true CN112329767A (en) 2021-02-05

Family

ID=74313813

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011106010.5A Pending CN112329767A (en) 2020-10-15 2020-10-15 Contract text image key information extraction system and method based on joint pre-training

Country Status (1)

Country Link
CN (1) CN112329767A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112801085A (en) * 2021-02-09 2021-05-14 沈阳麟龙科技股份有限公司 Method, device, medium and electronic equipment for recognizing characters in image
CN112926313A (en) * 2021-03-10 2021-06-08 新华智云科技有限公司 Method and system for extracting slot position information
CN113033660A (en) * 2021-03-24 2021-06-25 支付宝(杭州)信息技术有限公司 Universal language detection method, device and equipment
CN114022883A (en) * 2021-11-05 2022-02-08 深圳前海环融联易信息科技服务有限公司 Financial field transaction file form date extraction method based on model
CN114170482A (en) * 2022-02-11 2022-03-11 阿里巴巴达摩院(杭州)科技有限公司 Model training method, device, equipment and medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111444721A (en) * 2020-05-27 2020-07-24 南京大学 Chinese text key information extraction method based on pre-training language model
CN111680490A (en) * 2020-06-10 2020-09-18 东南大学 Cross-modal document processing method and device and electronic equipment

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111444721A (en) * 2020-05-27 2020-07-24 南京大学 Chinese text key information extraction method based on pre-training language model
CN111680490A (en) * 2020-06-10 2020-09-18 东南大学 Cross-modal document processing method and device and electronic equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
YIHENG XU 等: ""LayoutLM: Pre-training of Text and Layout for Document Image Understanding"", 《ARXIV.ORG》, pages 2 - 5 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112801085A (en) * 2021-02-09 2021-05-14 沈阳麟龙科技股份有限公司 Method, device, medium and electronic equipment for recognizing characters in image
CN112926313A (en) * 2021-03-10 2021-06-08 新华智云科技有限公司 Method and system for extracting slot position information
CN112926313B (en) * 2021-03-10 2023-08-15 新华智云科技有限公司 Method and system for extracting slot position information
CN113033660A (en) * 2021-03-24 2021-06-25 支付宝(杭州)信息技术有限公司 Universal language detection method, device and equipment
CN114022883A (en) * 2021-11-05 2022-02-08 深圳前海环融联易信息科技服务有限公司 Financial field transaction file form date extraction method based on model
CN114170482A (en) * 2022-02-11 2022-03-11 阿里巴巴达摩院(杭州)科技有限公司 Model training method, device, equipment and medium
CN114170482B (en) * 2022-02-11 2022-05-17 阿里巴巴达摩院(杭州)科技有限公司 Document pre-training model training method, device, equipment and medium

Similar Documents

Publication Publication Date Title
CN111985239B (en) Entity identification method, entity identification device, electronic equipment and storage medium
CN108804530B (en) Subtitling areas of an image
CN112329767A (en) Contract text image key information extraction system and method based on joint pre-training
CN110134946B (en) Machine reading understanding method for complex data
CN112560478B (en) Chinese address Roberta-BiLSTM-CRF coupling analysis method using semantic annotation
CN110414009B (en) Burma bilingual parallel sentence pair extraction method and device based on BilSTM-CNN
CN111966812B (en) Automatic question answering method based on dynamic word vector and storage medium
CN114926150B (en) Digital intelligent auditing method and device for transformer technology compliance assessment
CN111553350A (en) Attention mechanism text recognition method based on deep learning
CN114170411A (en) Picture emotion recognition method integrating multi-scale information
Selvam et al. A transformer-based framework for scene text recognition
CN110503090A (en) Character machining network training method, character detection method and character machining device based on limited attention model
CN114416991A (en) Method and system for analyzing text emotion reason based on prompt
CN114238649A (en) Common sense concept enhanced language model pre-training method
CN110175330A (en) A kind of name entity recognition method based on attention mechanism
CN117609536A (en) Language-guided reference expression understanding reasoning network system and reasoning method
CN116522165B (en) Public opinion text matching system and method based on twin structure
CN112749566B (en) Semantic matching method and device for English writing assistance
CN110929013A (en) Image question-answer implementation method based on bottom-up entry and positioning information fusion
CN116362247A (en) Entity extraction method based on MRC framework
CN112613316B (en) Method and system for generating ancient Chinese labeling model
CN115205874A (en) Off-line handwritten mathematical formula recognition method based on deep learning
CN115359486A (en) Method and system for determining custom information in document image
CN114580397A (en) Method and system for detecting < 35881 > and cursory comments
CN114357166A (en) Text classification method based on deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination