CN112329767A

CN112329767A - Contract text image key information extraction system and method based on joint pre-training

Info

Publication number: CN112329767A
Application number: CN202011106010.5A
Authority: CN
Inventors: 杨威
Original assignee: Fangzheng Zhushi Wuhan Technology Development Co ltd
Current assignee: Fangzheng Zhushi Wuhan Technology Development Co ltd
Priority date: 2020-10-15
Filing date: 2020-10-15
Publication date: 2021-02-05

Abstract

The invention relates to a contract text image key information extraction system and method based on joint pre-training, wherein the system comprises: the pre-training model is obtained by inputting a plurality of contract text images to perform pre-training task training, and the pre-training task comprises text prediction based on image positions; the training model is obtained by inputting a plurality of contract text images marked with positions for extracting information to train a task, and the training task comprises information extraction by using a pre-training model; inputting the contract text image to be detected into the trained training model to obtain the position and characters of the predefined extracted information of the training model; the newly added pre-training task not only fuses image features, but also fuses a character prediction task, so that the model can learn more prior knowledge, a large amount of manpower is saved due to the fact that data do not need to be marked in the pre-training stage, and the accuracy of information extraction is higher.

Description

Contract text image key information extraction system and method based on joint pre-training

Technical Field

The invention relates to the field of extraction of text image information, in particular to a contract text image key information extraction system and method based on joint pre-training.

Background

The extraction of the key information of the contract text image refers to extracting the key information which is interested by the user and needs to be extracted from the contract scanned piece or the contract image by using some methods, such as entities such as contract signing 'A party' and 'B party', contract signing time of the contract, contract amount of the contract and the like.

Nowadays, many companies still use the traditional method to search all the entities to be extracted, including "party a, party b, contract time, contract amount" and the like, in the contract one by one through extracting the entities one by one from the commercial contract by manpower, which not only consumes time but also has great labor cost.

On the other hand, there are also many companies that try to extract key information from the contract text using an automated extraction method.

OCR, an optical character recognition technique, is currently widely used in many fields such as handwritten character recognition, key information recognition based on photographs such as bank card identification cards, and character recognition of contract text images. Meanwhile, with the rapid development of deep learning algorithms, Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Attention mechanisms (Attention mechanisms) are successfully applied to OCR applications.

For example, chinese patent application publication No. CN110458162A discloses a method for automatically extracting text information from an end to an end based on deep learning (convolutional neural network), which directly trains a model based on a deep learning algorithm by labeling a large amount of training data, and the model can be used in actual recognition work after completion.

However, no matter the deep learning algorithm based on CNN, Faster R-CNN, Mask R-CNN or GCN, the method has two obvious defects in automatically extracting contract key information:

1. they need to manually label a large amount of label training data, that is, to label the exact position of entities such as party a, party b, contract time, contract amount, etc. for each contract, which is expensive and results in a long project period.

2. The model spends a large amount of time learning information such as structure, layout, position relation and the like of the contract text, the marked data do not play the maximum monitoring and learning effect, and finally a less-ideal recognition effect is caused.

Disclosure of Invention

The invention provides a contract text image key information extraction system and method based on joint pre-training, aiming at the technical problems in the prior art, and solving the problems in the prior art.

The technical scheme for solving the technical problems is as follows: a contract text image key information extraction system based on joint pre-training comprises: pre-training a model and training a model;

the pre-training model is obtained by inputting a plurality of contract text images to perform pre-training task training, wherein the pre-training task comprises text prediction based on image positions;

the training model is obtained by inputting a plurality of contract text images marked with positions for extracting information to train a training task, wherein the training task comprises information extraction by using the pre-training model;

and inputting the contract text image to be detected into the trained model after training, and obtaining the position and characters of the predefined extracted information of the trained model.

A contract text image key information extraction method based on joint pre-training comprises the following steps:

step 1, defining a pre-training model and a pre-training task of text prediction based on image positions, inputting a plurality of contract text images to the pre-training model, calculating a target function according to the pre-training task, and then updating parameters of the pre-training model;

step 2, defining a training model and a training task for extracting information by using the pre-training model, inputting a plurality of contract text images marked with positions for extracting information to the training model, calculating a target function according to the training task, and then updating parameters of the training model;

and 3, inputting the contract text image to be detected into the trained model after training, and obtaining the position and characters of the predefined extracted information of the training model.

The invention has the beneficial effects that: the method is characterized in that a pre-training task is added to an information extraction method based on a deep learning algorithm in the prior art, the pre-training task integrates image characteristics and a character prediction task, so that more prior knowledge is learned by a model, and a large amount of manpower is saved as data does not need to be marked in a pre-training stage; compared with a single pre-training mode based on texts or images, the image and text-based combined pre-training mode has a better effect and higher accuracy of information extraction.

On the basis of the technical scheme, the invention can be further improved as follows.

Further, the pre-training model comprises: the system comprises an image preprocessing module I, a text and image embedding layer, an attention layer and a loss function layer I;

the first image preprocessing module performs character recognition on an input contract text image through an OCR tool and acquires the position of each character;

the text and image embedding layer embeds each character into a text and image embedding vector according to the position of the character;

the attention layer comprises a plurality of layers, the text and image embedding layer is used as input of a first layer, and each layer is output to a next layer after being operated by an attention mechanism;

and the loss function layer calculates and updates the parameters of the pre-training model through a loss function.

Further, the training model includes: the image preprocessing module II, the pre-training layer and the loss function layer II are used for carrying out image preprocessing;

the second image preprocessing module performs character recognition on the input contract text image marked with the position of the extracted information through an OCR tool and acquires the position of each character;

the pre-training layer comprises a text and image embedding layer and an attention layer in the pre-training model after training is completed, and pre-training is carried out on the contract text image marked with the position of the extracted information;

and the input of the second loss function layer is a prediction label for extracting information from the training model and a real label in a training set, and the parameters of the training model are updated according to the comparison result of the prediction label and the real label.

Further, the process of the first image preprocessing module and the second image preprocessing module for acquiring the position of each character includes:

and acquiring the horizontal coordinate and the vertical coordinate of the upper left corner and the lower right corner of each character minimum image block, and arranging the characters into a row according to the magnitude sequence of the horizontal coordinate and the vertical coordinate.

Further, the text and image embedding vectors include a text embedding layer and a 2-D coordinate embedding layer, the 2-D coordinate embedding layer including: the device comprises a text minimum image block left upper corner horizontal coordinate embedding layer, a text minimum image block left upper corner vertical coordinate embedding layer, a text minimum image block right lower corner horizontal coordinate embedding layer and a text minimum image block right lower corner vertical coordinate embedding layer.

Further, the attention mechanism adopts a multi-head self-attention mechanism or a common self-attention mechanism;

the attention mechanism adopts a calculation form when an ordinary self-attention mechanism is adopted as follows:

q, K, V are all tensors, d_kThe tensor K last dimension is represented and T represents the transpose operation.

Further, the loss function is calculated in the following manner: the BERT based MLM task calculates the penalty given the 2-D coordinates of the text minimum image block.

Further, the step 1 of training the pre-training model includes:

performing character recognition on an input contract text image through an OCR tool, and acquiring the position of each character;

embedding each character into a text and image embedding vector according to the position of the character;

designing an attention layer with a multi-layer structure, taking the text and image embedding layer as the input of a first layer of the attention layer, and outputting each layer to the next layer after the attention mechanism operation;

and calculating and updating the parameters of the pre-training model through a loss function.

Further, the step 2 of training the training model includes:

performing character recognition on the input contract text image marked with the position of the extracted information through an OCR tool, and acquiring the position of each character;

pre-training the contract text image marked with the position of the extracted information according to a pre-training layer; the pre-training layer is a text and image embedding layer and an attention layer in the pre-training model after training is finished;

and updating the parameters of the training model according to the comparison result of the predicted label and the real label.

The beneficial effect of adopting the further scheme is that: the pre-training task enables the pre-training model and the training model to learn language features and word meanings, meanwhile, enables the pre-training model and the training model to learn special rule features such as contracts and layout, visual information and the like, and the pre-training model is obviously better in performance of a key information extraction task in the next stage.

Drawings

FIG. 1 is a block diagram of a contract text image key information extraction system based on joint pre-training according to the present invention;

FIG. 2 is a schematic structural diagram of an embodiment of a pre-training model of a contract text image key information extraction system provided in the present invention;

FIG. 3 is a schematic structural diagram of an embodiment of a training model of a contract text image key information extraction system provided by the present invention;

FIG. 4 is a flowchart of an embodiment of a contract text image key information extraction method based on joint pre-training according to the present invention.

Detailed Description

The principles and features of this invention are described below in conjunction with the following drawings, which are set forth by way of illustration only and are not intended to limit the scope of the invention.

The invention provides a contract text image key information extraction system based on joint pre-training, as shown in fig. 1, which is a structural block diagram of the contract text image key information extraction system based on joint pre-training provided by the invention, as can be seen from fig. 1, the system comprises: a pre-training model and a training model.

The pre-training model is obtained by inputting a plurality of contract text images and performing pre-training task training, wherein the pre-training task comprises text prediction based on image positions.

The training model is obtained by inputting a plurality of contract text images marked with positions for extracting information to train a training task, wherein the training task comprises information extraction by utilizing a pre-training model.

And inputting the contract text image to be detected into the trained training model to obtain the position and characters of the predefined extracted information of the training model.

Specifically, the predefined extraction information may be words or characters set according to requirements.

Unlike OCR technology based on deep learning algorithm with only one training stage to identify key information of contract text image, the present invention uses not only deep learning algorithm but most importantly the technical means of joint pre-training based on image and text.

Many current solutions use a pre-trained model, such as a BERT model based on natural language pre-training, and a Mask R-CNN model based on image pre-training.

Taking a natural language pre-training model as an example, the training can be completed by utilizing a pure text without a large amount of manually labeled data in the pre-training stage, the model can learn the language rule and understand the semantics of words in the stage, and the model has a good effect in the next extraction task after the learning.

Similarly, in the field of image pre-training models, by pre-training the image models, the models master the rules and visual characteristics of common images, and further have better effect on the performance of the next extraction task.

The invention uses a pre-training technology combining images and language texts, aims to enable a model to learn language features and word meanings and also enable the model to learn rule features such as contracts and special typesetting layout, visual information and the like, and the pre-trained model is obviously superior in key information extraction task of the next stage.

Taking contract information extraction as an example, given a keyword "contract amount" in a contract, the corresponding contract amount value is likely to be on the right of or below the four words of "contract amount" and is unlikely to be on the left and above, and the important rule can be learned through the model in the pre-training stage.

In addition, in special texts such as contracts, the visual features of the text also imply information, such as whether the text is bold, underlined, and italicized, which is also important for the information extraction task.

The invention not only adds a pre-training task on the information extraction method based on the deep learning algorithm in the prior art, but also integrates the image characteristics and the character prediction task, so that the model can learn more priori knowledge and a large amount of manpower is saved because the pre-training stage does not need to label data; compared with a single pre-training mode based on texts or images, the image and text-based combined pre-training mode has a better effect and higher accuracy of information extraction.

Example 1

Embodiment 2 provided by the present invention is an embodiment of a contract text image key information extraction system based on joint pre-training, which includes: a pre-training model and a training model.

Preferably, as shown in fig. 2, a schematic structural diagram of an embodiment of a pre-training model of a contract text image key information extraction system provided by the present invention is shown, and as can be seen from fig. 2, the pre-training model includes: the image pre-processing module I, the text and image embedding layer, the attention layer and the loss function layer I.

And the first image preprocessing module performs character recognition on the input contract text image through an OCR tool and acquires the position of each character.

The text and image embedding layer embeds each word into a text and image embedding vector according to the position of the word, and particularly, the text and image embedding vector is a high-dimensional mathematical vector space.

The attention layer comprises a plurality of layers, the text and image embedding layer is used as input of the first layer, and each layer is output to the next layer after the attention mechanism operation.

Preferably, the attention mechanism may take many forms, such as a common multi-headed self-attention mechanism or a common self-attention mechanism.

The attention mechanism adopts a calculation form when the ordinary self-attention mechanism is adopted as follows:

Specifically, the number N of the attention layers may be selected from 6 or 12, and once determined, the whole process is not changed.

The LOSS function layer LOSS calculates and updates the parameters of the pre-training model through a LOSS function.

The purpose of calculating the LOSS function LOSS is to expect updates to parameters in the pre-trained model (e.g., text and image embedding vectors).

Preferably, the LOSS function LOSS is calculated in various ways, which may specifically be: given the 2-D coordinates of the smallest image block of a character, the loss is calculated based on the MLM (masked language model) task of BERT (Bidirectional Encoder Representation from Transformers), the pre-training target of BERT comprises two, MLM blocks a certain proportion of words in the word sequence through Mask, and the model is guided to be able to successfully predict words to Mask positions.

For example, given a certain text sequence, the MLM task will be the 3 rd character MASK, and the model needs to predict the text based on the feature vector of the text image and the text feature vector, and finally outputs a prediction distribution P (w3| w1, w2, w4 … wn, image _ vecs); wherein w1, w2, …, wn represents characters, image _ vecs represents an image coding vector of a minimum image block of each character of n characters, and an image coding model adopts Faster R-CNN.

Preferably, as shown in fig. 3, which is a schematic structural diagram of an embodiment of a training model of a contract text image key information extraction system provided by the present invention, as can be seen from fig. 3, the training model includes: the image pre-processing module II, the pre-training layer and the loss function layer II.

And the second image preprocessing module performs character recognition on the input contract text image marked with the position of the extracted information through an OCR tool and acquires the position of each character.

The pre-training layer comprises a text and image embedding layer and an attention layer in a trained pre-training model, and pre-training is carried out on contract text images marked with positions of extracted information.

And the input of the second loss function layer is a prediction label for information extraction of the training model and a real label in the training set, and the parameters of the training model are updated according to the comparison result of the prediction label and the real label.

The training model takes the information extraction task as a sequence marking task, and the model used by the pre-training layer and the network model and the parameters of the pre-training model after the training are completed are used by the parameters. Specifically, the training set can be obtained by performing BIO labeling on all the characters or characters, wherein the BIO labeling labels each element as "B-X", "I-X", or "O". Wherein "B-X" indicates that the fragment in which the element is located belongs to X type and the element is at the beginning of the fragment, "I-X" indicates that the fragment in which the element is located belongs to X type and the element is in the middle position of the fragment, and "O" indicates that the fragment does not belong to any type.

Specifically, the process of acquiring the position of each character by the image preprocessing module i and the image preprocessing module ii includes:

and acquiring horizontal coordinates and vertical coordinates (referring to relative coordinates of a fixed point) of the upper left corner and the lower right corner of the minimum image block of each character, and arranging the characters into a row according to the magnitude sequence of the horizontal coordinates and the vertical coordinates.

Further, as can be seen from fig. 2, the text and image embedding vectors include a text embedding layer and a 2-D coordinate embedding layer, and the 2-D coordinate embedding layer is divided into four layers, including: the device comprises a text minimum image block left upper corner horizontal coordinate embedding layer, a text minimum image block left upper corner vertical coordinate embedding layer, a text minimum image block right lower corner horizontal coordinate embedding layer and a text minimum image block right lower corner vertical coordinate embedding layer.

Taking the top left horizontal coordinate embedding layer as an example, when the top left horizontal coordinates of different characters are the same, the top left horizontal coordinate embedding vectors of the characters are the same.

In addition, the initial text and image embedding vectors may be randomized (e.g., sampled from X-N (0, 1)) and uniform vector dimensions (e.g., 768 dimensions) determined; the vectors of the embedding layer are continuously updated as the pre-training process continues.

Example 2

Embodiment 2 provided by the present invention is an embodiment of a contract text image key information extraction method based on joint pre-training provided by the present invention, and as shown in fig. 4, is a flowchart of an embodiment of a contract text image key information extraction method based on joint pre-training provided by the present invention, and as can be seen from fig. 4, the embodiment of the method includes:

step 1, defining a pre-training model and a pre-training task of text prediction based on image positions, inputting a plurality of contract text images to the pre-training model, calculating a target function according to the pre-training task, and then updating parameters of the pre-training model.

Specifically, N closed text images are prepared, wherein text information and the like do not need to be marked; the network of pre-trained models can use, for example, a BERT backbone network, then run through the contract text images one by one for OCR character recognition, and finally pre-trained based on the set pre-training task.

Preferably, the process of training the pre-training model comprises:

and performing character recognition on the input contract text image through an OCR tool, and acquiring the position of each character.

Each word is embedded into the text and image embedding vectors according to its position.

And designing an attention layer of a multi-layer structure, taking a text and image embedding layer as an input of a first layer of the attention layer, and outputting each layer to the next layer after the attention mechanism operation.

Specifically, in the pre-training process, an implementer needs to sequentially process N prepared images of the same text, firstly perform OCR character recognition to obtain the 2-D coordinates of the minimum image block of the characters and the characters, enter a text and image embedding layer to obtain a text embedding vector and an image 2-D coordinate embedding vector, then perform attention calculation on the vectors, finally calculate a target function according to a pre-training task, and then update the parameters of the whole pre-training network model.

And counting 1 Epoch in the traversal of all the N text images, and carrying out custom training on the sizes of the epochs according to the training effect.

And 2, defining a training model and a training task for extracting information by using the pre-training model, inputting a plurality of contract text images marked with positions for extracting information to the training model, calculating a target function according to the training task, and updating parameters of the training model.

Specifically, M contract text images marked with positions for extracting information are marked, a network layer suitable for a downstream information extraction task is newly added on the basis of the pre-training network trained in the step 1, and then M contract text images are traversed one by one and OCR character recognition is carried out formally for training.

Preferably, the training process of the training model includes:

and performing character recognition on the input contract text image marked with the position of the extracted information through an OCR tool, and acquiring the position of each character.

Pre-training the contract text image marked with the position of the extracted information according to a pre-training layer; the pre-training layer is a text and image embedding layer and an attention layer in the pre-training model after training.

Specifically, in the training process, an implementer needs to add an FC layer (e.g., a full-connection layer network with an output dimension of N _ CLASS) based on a downstream extraction task based on a pre-training model (a text and image embedding layer + an attention layer), then traverse M pieces of the same text image one by one, calculate an objective function according to the training task, and then update model parameters of a neural network.

And counting 1 Epoch in the traversal of all the M text images once, and custom-training the sizes of the epochs according to the training effect.

And 3, inputting the contract text image to be detected into the trained training model to obtain the predefined position and characters of the extracted information of the training model.

After the pre-training process and the training process are finished, the user can formally enter a use stage:

inputting text images, recognizing characters by an OCR algorithm, and automatically extracting the positions and characters of predefined information in the text by the model.

The present invention is not limited to the above preferred embodiments, and any modifications, equivalent replacements, improvements, etc. within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A contract text image key information extraction system based on joint pre-training is characterized by comprising: pre-training a model and training a model;

2. The system of claim 1, wherein the pre-trained model comprises: the system comprises an image preprocessing module I, a text and image embedding layer, an attention layer and a loss function layer I;

3. The system of claim 2, wherein the training model comprises: the image preprocessing module II, the pre-training layer and the loss function layer II are used for carrying out image preprocessing;

4. The system of claim 3, wherein the first image pre-processing module and the second image pre-processing module for obtaining the position of each character comprises:

5. The system of claim 3, wherein the text and image embedding vectors comprise a text embedding layer and a 2-D coordinate embedding layer, the 2-D coordinate embedding layer comprising: the device comprises a text minimum image block left upper corner horizontal coordinate embedding layer, a text minimum image block left upper corner vertical coordinate embedding layer, a text minimum image block right lower corner horizontal coordinate embedding layer and a text minimum image block right lower corner vertical coordinate embedding layer.

6. The system of claim 2, wherein the attention mechanism is a multi-headed self-attention mechanism or a common self-attention mechanism;

7. The system of claim 2, wherein the loss function is calculated by: the BERT based MLM task calculates the penalty given the 2-D coordinates of the text minimum image block.

8. A contract text image key information extraction method based on joint pre-training is characterized by comprising the following steps:

9. The method of claim 8, wherein the step 1 training the pre-trained model comprises:

10. The method of claim 9, wherein the step 2 training the training model comprises: