CN115359486A

CN115359486A - Method and system for determining custom information in document image

Info

Publication number: CN115359486A
Application number: CN202210853880.1A
Authority: CN
Inventors: 宋佳奇; 王勇; 朱军民
Original assignee: Beijing Yidao Boshi Technology Co ltd
Current assignee: Beijing Yidao Boshi Technology Co ltd
Priority date: 2022-07-12
Filing date: 2022-07-12
Publication date: 2022-11-18

Abstract

The invention discloses a method and a system for determining custom information in a document image, and relates to the field of computer vision. The method comprises the following steps: performing first question and answer task training on a pre-training model by using a Machine Reading Comprehension (MRC) data set and combining text features, image features, absolute position features and relative position features of each character to obtain a retraining model; performing second question-answering task training on the retraining model by using the specified sample data set to obtain a final model; inputting a document image of the self-defining information to be determined, and outputting the text content of the self-defining information. According to the technical scheme, zero sample learning is achieved by utilizing a pre-training model and a question-answering task, a large corpus pre-training model and a question-answering system are combined together, in addition, the image and relative position characteristics are added, end-to-end training and prediction are achieved, and custom information in a document is finally output.

Description

Method and system for determining custom information in document image

Technical Field

The invention relates to the field of computer vision, in particular to a method and a system for determining custom information in a document image.

Background

In reality, many situations of extracting key information from documents are usually encountered, and with the continuous development and popularization of deep learning, more and more people abandon the traditional low-efficiency method of extracting one template from one class of documents, and instead adopt a deep learning method of extracting information by training various neural networks. Training a neural network requires firstly data and secondly determining the type of entity to be extracted, i.e. predefined entities, so that the data can be labeled accordingly. These annotation data are heavily trained by neural networks so that the network can learn useful information to extract predefined entities.

The actual situation is various, and several problems are encountered when the information extraction is performed by the above method. First, what if the data is rare? Such as documents relating to private information of some individuals or companies: checks, contracts, etc. where it is difficult to obtain large amounts of data from a network or other means, the small amount of data needed to train neural networks that are now increasingly complex is not sufficient. Especially when the content or document layout varies greatly from sample to sample, the model has difficulty in learning enough knowledge from extremely small data to extract information. Second, a neural network model is well trained assuming sufficient data and can extract a predefined number of key information entities well. If there is a need to extract other entities in the data later (not in the predefined entities), it is difficult to extract the information of the untrained entities with the trained model. Of course, the new entities that are desired to be extracted can be labeled on the data set and then a model retrained. This is obviously time consuming and laborious if the entity types are updated very frequently.

After entering the big data era, the available data grows exponentially, but most of the data is unlabeled and may not be related to specific tasks to be solved. How to learn useful knowledge from these massive amounts of data and apply it to a specific task? This requires the use of pre-trained models, which are usually trained by designing some unsupervised training tasks, aiming to learn general information in the data, such as image classification, grammar and syntax in language, etc. The pre-training model was initially a breakthrough development on ImageNet in the field of computer vision. With the appearance of BERT and the excellent performance thereof, the pre-training model is rapidly developed in the NLP field and obtains good results. After the pre-training model is obtained, the pre-training model can be applied to different downstream tasks such as question answering systems, text classification, target detection, named body recognition and the like by changing the output layer of the pre-training model. Compared with a model trained from zero, the pre-training model can provide good preliminary knowledge, and the knowledge greatly helps downstream tasks, so that the model can be converged more quickly and more accurately.

The question-answering system, as a classic task in natural language processing, is a high-level form of an information retrieval system, and aims to answer a question posed by a user in natural language with accurate and concise natural language. The research on question-answering systems dates back to 60's of the 19 th century at the earliest, and the methods at the time were based on templates and rules, and the robustness and accuracy of models were poor. The methods and techniques of the existing question-answering system are many, and the method is divided into two kinds according to the difference of processing methods: a knowledge-graph-based question-answering system and a reading understanding-based question-answering system. The first method is to construct a factual question-answering system by using a knowledge graph, and to find answers from the knowledge graph, so that the accuracy rate is high. The disadvantage is that the knowledge graph is over-dependent, and answers beyond the knowledge graph cannot be given. This requires sufficient resources to build a relatively large-scale knowledge map. The second is reading the unstructured article to understand the answer. The dataform is to give an article around which some questions are posed, the task being to extract answers directly from the article. Common models are FastQAExt, BERT, roBERTA, etc.

Disclosure of Invention

The application relates to a method and a system for determining custom information in a document image. According to the technical scheme, zero sample learning is achieved through a pre-training model and a question-answering task, a large corpus pre-training model and a question-answering system are combined together, in addition, images and relative position characteristics are added, end-to-end training and prediction are achieved, and custom information in documents is output finally.

According to a first aspect of the technical solution of the present application, a method for determining custom information in a document image is provided, the method includes the following steps:

step 1: performing first question and answer task training on a pre-training model by using a Machine Reading Comprehension (MRC) data set and combining text features, image features, absolute position features and relative position features of each character to obtain a retraining model;

step 2: performing second question-answering task training on the retraining model by using the specified sample data set to obtain a final model;

and step 3: inputting a document image of the self-defining information to be determined, and outputting the text content of the self-defining information.

Here, "custom information" means that the information that is desired to be determined is not in a predefined entity class that the model has previously been trained on.

Further, in step 1, the MRC data set includes fields of medicine, education, entertainment, encyclopedia, military, law, and the like.

Further, the step 1 specifically includes:

step 11: selecting an initial sample from the machine reading comprehension data set, and inputting an initial sample image, text in a text box identified from the initial sample image, position coordinates of the text box and entity information to form a problem;

step 12: performing feature extraction on the text information and the image information based on a pre-training model, dividing the position information into an absolute position and a relative position, and performing feature extraction respectively to obtain a text feature, an image feature, an absolute position feature and a relative position feature;

step 13: based on a multi-head self-attention mechanism, performing feature coding on the text features, the image features, the absolute position features and the relative position features through a Transformer to obtain a coding feature vector of each character;

step 14: and performing two-stage task reasoning on the coding feature vector of each character to obtain a retraining model.

Further, in the step 12, the pre-trained model includes a pre-trained chinese BERT model and a pre-trained ResNet-50 network.

Further, the step 12 specifically includes:

character coding layer: inputting text information including questions and articles, and performing text feature extraction through a pre-trained Chinese BERT model to obtain text features;

image coding layer: inputting image information, and extracting image characteristics by combining a pre-trained ResNet-50 network with ROIAlign to obtain image characteristics;

absolute position encoding layer: sequencing each character in the input character string from 0 to obtain an absolute position feature;

relative position encoding layer: and converting the coordinates of the upper left vertex and the lower right vertex of the character frame into feature vectors to obtain relative position features.

Further, in the character encoding layer, the question is a question composed of the entity information: q1 \ 8230; \8230; qi, the article is the text within the text box identified from the initial sample image: t1 \ 8230, tj, i and j are positive integers.

Further, in the character encoding layer, the text features are a text feature matrix of M × D, where M is the number of input characters and M = i + j +3; d is the input dimension size of the pre-trained Chinese BERT model.

Further, the coding layer sequence in the character coding layer is as follows:

length of	1	i	1	j	1
						Sequential content	[CLS]	q1……qi	[SEP]	t1……tj	[SEP]

Wherein [ CLS ] is a designator whose word is embedded for performing classification tasks; [ SEP ] is a separator for separating questions and articles.

Further, the dimension of the character encoding layer weight is N x D, where D is the input dimension of the pre-trained Chinese BERT model, N is the number of characters in the dictionary of the pre-trained Chinese BERT model, and N and D are positive integers.

Further, in the image coding layer, all characters in the same text box share the image feature of the text box.

Further, the coding layer sequence in the image coding layer is:

length of	1	i+1	j	1
					Sequential content	vf		0……0	v0……vp	0

Wherein v0 \ 8230, v 8230and vp are image feature vectors of each character, and p is more than or equal to 0.

Further, the coding layer sequence in the absolute position coding layer is:

length of	1+i+1+j+1
		Sequential content	0……h

Wherein h =1+, i +, 1+, j +1.

Further, the relative position encoding layer includes:

x1 coding layer, wherein x1 is a coordinate value on an x axis of a left upper point of a character frame of a certain character;

x2 coding layer, wherein x2 is the coordinate value on the x axis of the lower right point of the character frame of a certain character;

y1 coding layer, wherein y1 is a coordinate value on the y axis of the upper left point of a character frame of a certain character;

y2 coding layer, y2 is coordinate value on the y axis of the lower right point of the character frame of a certain character.

Further, the coding layer sequence in the relative position coding layer is:

wherein, the parameters of the x1 coding layer and the x2 coding layer are the same, and the parameters of the y1 coding layer and the y2 coding layer are the same.

Further, in step 13, a 12-head self-attention mechanism is used to encode the features.

Further, in step 13, the formula is calculated as follows:

wherein, the matrixes Q, K and V respectively represent Query, key and Value, represent the mapping of the text characteristic vector/image characteristic vector/absolute position characteristic vector/relative position characteristic vector of the input encoder in three different low-dimensional spaces, d _k Representing the dimension size of the corresponding feature vector.

Further, the step 14 specifically includes:

in the first stage: constructing a first second classifier, and dividing all characters into meaningful characters related to a start/end position and meaningless characters related to other information according to the coding feature vector of each character;

the second stage is as follows: constructing a second classifier for classifying the significant characters related to the start/end positions into start position characters and end position characters based on the encoding feature vector of each significant character,

this results in a retrained model.

Further, in step 2, the specified sample data set includes a custom question and an article composed of specified data.

Further, the step 3 specifically includes: inputting a document image of the custom information to be determined to a final model, outputting a starting position character and an ending position character of the custom information, and further determining the custom information.

According to a second aspect of the present invention, there is provided a system for determining custom information in a document image, the system comprising: a processor and a memory for storing executable instructions; wherein the processor is configured to execute the executable instructions to perform the method of determining customization information in a document image according to any one of the above aspects.

According to a third aspect of the present invention, there is provided a computer-readable storage medium, characterized in that a computer program is stored thereon, which when executed by a processor implements the method for determining customization information in a document image as described in any one of the above aspects.

The invention has the beneficial effects that:

1. through the combination of the pre-training model and the question-answering task, the extraction precision of the trained entity is greatly improved, the user-defined entity can be well extracted, and the generalization capability of the model is strong;

2. before the specified sample training is carried out, a question and answer task training set consisting of a large number of universal texts is used for retraining the pre-training model, so that the understanding and solving capability of the model on the question and answer task can be greatly improved;

3. the model fully and efficiently utilizes the document features, including grammar and semantics in texts, relations between texts in sentences, image features of the document, position information of the text on the image and the like, and the multi-modal features enable the model to more fully understand the content and format information of the document;

4. the pre-training models such as Resnet-50 and BERT which are popular at present and have excellent effects are applied, so that the information learned by the models is richer, the learning speed is higher, and the precision is higher. .

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the structures shown in the drawings without creative efforts.

FIG. 1 illustrates an exemplary shopping ticket.

FIG. 2 shows a diagram of a pre-trained model algorithm architecture according to an embodiment of the invention.

FIG. 3 shows an algorithm flow diagram according to an embodiment of the invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. The following description refers to the accompanying drawings in which the same numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the disclosure, as detailed in the appended claims.

The terms "first," "second," and the like in the description and in the claims of the present disclosure are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein.

Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

A plurality, including two or more.

And/or, it should be understood that, for the term "and/or" as used in this disclosure, it is merely one type of association that describes an associated object, meaning that three types of relationships may exist. For example, a and/or B, may represent: a exists alone, A and B exist simultaneously, and B exists alone.

The invention relates to an accurate document image custom information extraction method. The custom information is the information that is desired to be extracted from predefined entity classes that have not been trained before the model. Most neural network models do not work well for this custom information extraction. The main reasons are several:

1. the model has insufficient learning for images and text in documents. In order to extract the entities predefined in the training set well, the model can pay more attention to the features around the entities in the learning process, and omit the features of other texts in the document and the global features. Therefore, if a user-defined entity is required to be extracted, accurate extraction cannot be achieved due to the fact that the characteristics learned by the model are insufficient;

2. and designing an output layer of the model. Most entity extraction models classify the output according to predefined categories, with information not in the predefined categories being uniformly classified as 'other'. Therefore, when the user-defined categories are required to be added later, the model can divide the user-defined categories into 'other' categories without modifying an output layer and retraining, and accurate extraction cannot be achieved.

The invention aims to accurately extract the self-defined entity information on the premise of not retraining the model or modifying the output layer. Both of the above-mentioned problems are solved well by a pre-trained model and a question-and-answer task.

1. The language model trained by the unsupervised task on the large-scale corpus not only has good understanding on grammar, but also has deeper understanding on the semantics inside sentences and among sentences, so that the model is not limited to a certain type of text or information, but focuses on the overall understanding, and a solid foundation is laid for the tuning of downstream tasks;

2. regarding the design of the output layer, the technical scheme of the application does not adopt a classification task, but a question and answer task. Each entity to be extracted is designed into one or more questions. As shown in fig. 1, when the information of 'total amount' on the shopping receipt is to be extracted, the question can be designed as "how much money is spent in total for this shopping? "articles are all text on a shopping receipt. The question and the article are input into the model, and the output is the initial position and the end position of the answer interval of the question in the article. As long as the entity information appears in the article, the starting position and the ending position of the corresponding interval exist, and the condition that the classification task output does not have the category corresponding to the self-defined entity is avoided.

The pre-training model is obtained by training tasks such as whole word masking on large-scale linguistic data (Chinese Wikipedia, baidu encyclopedia, news and other data, the total word number reaches 5.4B). With the pre-trained model, the following work is divided into two stages: 1. retraining the question and answer tasks of the pre-trained model on MRC (machine reading understanding) data collected by the pre-trained model (including the fields of medical treatment, education, entertainment, encyclopedia, military, law and the like), and mainly aiming at enhancing the understanding and solving capability of the model on the general text question and answer tasks; 2. and (4) training the question-answering task on the model retrained in the step 1 by using the target sample to obtain a final model. It should be noted here that, when the pre-trained model is retrained in 1, an image coding layer and a relative position coding layer are added to the coding layer, and multiple features of images, texts and positions are fused with each other, so that the model can be more comprehensively understood for documents.

Examples

Pre-training model

1) Feature extraction module

The input of the model is the image of the whole sample, the characters in the text box recognized by OCR, the position coordinates of the text box and the problem formed by the entity information. The main task of this module is to encode these inputs, generating feature vectors that can be input to subsequent modules.

The embodiment mainly extracts features in three aspects: text, image, and location.

Text features are extracted by a character encoding layer in the module. Text entry is divided into two parts: questions and articles. The problem is composed of entity information to be extracted, and the problem form is not unique; the articles are the entire textual content in the sample document. Each small square in the algorithm structure diagram represents a character, as shown in fig. 2, the problem consists of three characters q1, q2 and q3, and the article consists of 4 characters t1, t2, t3 and t 4. Questions and articles are separated by a special character [ SEP ], and a special character [ CLS ] is added at the head. The weight of the character coding layer is initialized by utilizing a Chinese BERT model which is pre-trained on a large number of texts such as Chinese Wikipedia and the like, the model has the capability of analyzing grammar and semantics after pre-training, and the initialized coding layer can ensure that the training convergence speed is higher and the precision is higher. Assuming that the model dictionary has N characters, the dimension of the character encoding layer weight is N x D, where D is the input dimension size of the model. Assuming that the number of input characters is M, each input character finds a corresponding vector in a character coding layer according to the position of the input character in the dictionary, so that an M x D text feature matrix is obtained after the input character passes through the character coding layer.

For an input image, the most important is size normalization with a constant length-width ratio and boundary 0 complementing, so that the size of the image can support operations such as convolution and down sampling required by a neural network in an encoding module, and global and local feature information is maximally retained. The image feature coding mainly adopts a deep convolution neural network to code the text block and the image features around the text block. The ResNet-50 pre-trained by ImageNet mass images is used as a feature coding network, and the model has strong representation capability on the images and can well extract and represent key features of the images. The step aims to output the image feature code corresponding to each text box, so the ROIAlign needs to be applied to the corresponding position of the network output feature map in combination with the position of the text box to acquire the corresponding image feature code. Characters in the same text box share the image features of the text box, as shown in fig. 2, it is assumed that t1 and t2 are in different text boxes, so the features in the image coding layers corresponding to t1 and t2 are different, namely v0 and v1; t3, t4 are in the same text box, so they correspond to the same image feature v2. Here, the image feature of the entire sample document is placed at the position of the first character [ CLS ]. Since neither the problem nor [ SEP ] appears in the document, the corresponding image features are uniformly set to zero matrix.

The position feature is divided into an absolute position part and a relative position part. The absolute position is the number of a certain character in the input character string, and as shown in fig. 2, the absolute position coding layer is arranged from 0 to the bottom in sequence, and can receive 512 characters at the maximum. The relative position is that coordinates of two points of the upper left and the lower right of a character frame recognized by OCR are converted into vectors through an encoding layer, and then the relative position relation between characters is learned through a subsequent feature encoding module. Taking the character t1 in fig. 2 as an example, t1.X1 represents a coordinate value on the x-axis of the upper left point of the character frame, t1.X2 represents a coordinate value on the x-axis of the lower right point of the character frame, t1.Y1 represents a coordinate value on the y-axis of the upper left point of the character frame, and t1.Y2 represents a coordinate value on the y-axis of the lower right point of the character frame. Wherein the x1 coding layer and the x2 coding layer are the same, and the y1 coding layer and the y2 coding layer are the same. Since the question and the special character do not appear in the document, the corresponding relative position coordinates are uniformly set to 0.

2) Feature encoding module

The module mainly comprises a Transformer encoder. The Transformer is a seq2seq model proposed by google brain 2017 in the article "Attention is all you needed", and its essence is composed of an encoder and a decoder, and the encoder is mainly used here. The most central model in the encoder is a self-attention mechanism, and can be understood as being the calculation of the correlation, and the natural idea is to pay more attention to the large correlation. First, the concept of Query, key and Value is introduced, where Query means Query, and Key is used to compare with Query to obtain a score (correlation or similarity) and then multiply the Value by Value to obtain the final result. Multiple sets of Query, key and Value can be designed to extract different features respectively, which is a multi-head self-attention mechanism, and a 12-head self-attention mechanism is used in the module to encode the features, although the parameters are experimentally selected as the number of heads is larger and the better. The feature vector obtained by the feature coding module not only contains information of image features, grammar and semantics of characters, specific sentence features of the samples, but also the mutual position relation between text boxes, so that the model can learn various information to better complete the following tasks.

For example, in the multi-head attention mechanism, an input vector x passes through h sets of weight matrixes to obtain h sets of query, key and value vectors, and each word has h sets of query, key and value vectors. The attention score (attention score) of the current word with all the words can be calculated by multiplying the query vector of one word with the key vector of each of the other words. Dividing the a ttention score by the square root of the first dimension d k of the weight matrix, and then performing softmax operation to obtain a weight value of each word. And multiplying the weighted value obtained by each word by the value vector of the word, adding the weighted values, calculating to obtain h output matrixes Z1, Z2,.

In the algorithm, query, key and Value are represented by matrices Q, K and V, which respectively represent the mapping of the feature vector of the input encoder in three different low-dimensional spaces, Q and K are the inner products of the vectors, and then the vectors and V are weighted to obtain the feature vector of self attention, and the following formula is calculated:

wherein d is _k Representing the dimension size of the feature vector.

3) Task reasoning module

After passing through the feature coding module, each character corresponds to a vector containing rich information. The module uses these vectors to find the starting position (S) and ending position (E) of the answer to the question. This is done in two stages: the first stage is to construct a two-stage classifier to determine whether a character belongs to the start or end position or to another (O). Each character vector is classified into two categories, meaningful characters (start or end) and nonsense characters. The second stage also constructs a two-classifier to discriminate the start and end positions for meaningful characters. The two-stage classification can make the task of the classifier more accurate, and the complicated task is divided into a plurality of simple tasks, so that the reasoning speed and precision can be improved. With the start and end positions of the answer, the entity corresponding to the question is the text content from the start position to the end position.

As shown in fig. 3, the technical solution of the present application has the following flow:

in the first stage, the pre-trained model is retrained by using the MRC data collected by the pre-trained model. The data set at this stage encompasses a variety of fields such as medical treatment, education, entertainment, encyclopedia, military, law, etc., and the ability required for the model to learn from various types of texts to solve the question-answering task is desired, such as the relevance between the question and the article, how to locate the answer in the article through the question, and how to quickly relate to the relevant part of the article through the keywords in the question and the self-attention mechanism.

The second stage can train the model to extract the desired entity types using the specified sample set. Through experimental comparison, the question-answer task retraining of the general text in the first stage can effectively improve the extraction precision of the task in the second stage.

In the third stage, after carefully analyzing the sample, some custom entities are proposed to check the generalization ability of the model. It has been found through experimentation that models can be well extracted for custom entities that have not been trained. This is where the pre-training model and the first stage retraining are not successful.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a component of' 8230; \8230;" does not exclude the presence of another like element in a process, method, article, or apparatus that comprises the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the above implementation method can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation method. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.

While the present invention has been described with reference to the particular illustrative embodiments, it is to be understood that the invention is not limited to the disclosed embodiments, but is intended to cover various modifications, equivalent arrangements, and equivalents thereof, which may be made by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method for determining custom information in a document image is characterized by comprising the following steps:

step 1: reading and understanding a data set by a machine, and performing first question and answer task training on a pre-training model by combining the text characteristic, the image characteristic, the absolute position characteristic and the relative position characteristic of each character to obtain a retraining model;

2. The determination method according to claim 1, wherein the step 1 specifically comprises:

step 13: based on a multi-head self-attention mechanism, performing feature coding on the text features, the image features, the absolute position features and the relative position features to obtain a coding feature vector of each character;

3. The method of claim 2, wherein in step 12, the pre-trained models include a pre-trained chinese BERT model and a pre-trained ResNet-50 network.

4. The determination method according to claim 3, wherein the step 12 specifically includes:

character coding layer: inputting text information including problems and articles, and extracting text features through a pre-trained Chinese BERT model to obtain text features;

image coding layer: inputting image information, and extracting image features by combining a pre-trained ResNet-50 network with ROIAlign to obtain image features;

absolute position encoding layer: sequencing each character in the input character string from 0 to obtain absolute position characteristics;

5. The method according to claim 4, wherein in the character encoding layer, the question is a question composed of the entity information: q1 \ 8230; \8230; qi, the article is the text within the text box identified from the initial sample image: t1 \8230; \ 8230, tj, i and j are positive integers.

6. The method according to claim 5, wherein in the character encoding layer, the text features are a text feature matrix of M x D, where M is the number of input characters and M = i + j +3; d is the input dimension size of the pre-trained Chinese BERT model.

7. The method of claim 4, wherein the dimension of the character encoding layer weights is N x D, where D is an input dimension of the pre-trained Chinese BERT model, N is a number of characters in a dictionary of the pre-trained Chinese BERT model, and N and D are positive integers.

8. The method according to claim 4, wherein all characters in a same text box in the image coding layer share the image feature of the text box.

9. The determination method according to claim 4, wherein the relative position-coding layer comprises:

an x2 coding layer, wherein x2 is a coordinate value on an x axis of a lower right point of a character frame of a certain character;

y2 coding layer, y2 is the coordinate value of the lower right point of the character frame of a certain character on the y axis.

10. The method of claim 2, wherein in step 13, a 12-headed self-attention mechanism is used to encode the features.

11. The method of claim 10, wherein in step 13, the formula is calculated as follows:

12. The method according to claim 2, wherein the step 14 specifically includes:

in the first stage: constructing a first second classifier, and dividing all characters into meaningful characters related to the starting/ending positions and meaningless characters related to other information according to the coding feature vector of each character;

this results in a retrained model.

13. The method according to claim 1, wherein in step 2, the set of specified sample data includes a custom question and an article composed of specified data.

14. The method according to claim 1, wherein the step 3 specifically includes: inputting a document image of the custom information to be determined to a final model, outputting a starting position character and an ending position character of the custom information, and further determining the custom information.

15. A system for determining custom information in a document image, the system comprising: a processor and a memory for storing executable instructions; wherein the processor is configured to execute the executable instructions to perform a method of determining customization information in a document image in accordance with any one of claims 1 to 14.

16. A computer-readable storage medium, having stored thereon a computer program which, when executed by a processor, implements a method of determining customization information in a document image according to any one of claims 1 to 14.