CN116822634A - Document visual language reasoning method based on layout perception prompt - Google Patents

Document visual language reasoning method based on layout perception prompt Download PDF

Info

Publication number
CN116822634A
CN116822634A CN202310817907.6A CN202310817907A CN116822634A CN 116822634 A CN116822634 A CN 116822634A CN 202310817907 A CN202310817907 A CN 202310817907A CN 116822634 A CN116822634 A CN 116822634A
Authority
CN
China
Prior art keywords
layout
document
sample
visual
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310817907.6A
Other languages
Chinese (zh)
Inventor
姚俊豪
徐行
王磊
沈复民
申恒涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN202310817907.6A priority Critical patent/CN116822634A/en
Publication of CN116822634A publication Critical patent/CN116822634A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • G06N5/045Explanation of inference; Explainable artificial intelligence [XAI]; Interpretable artificial intelligence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/217Validation; Performance evaluation; Active pattern learning techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • G06N5/046Forward inferencing; Production systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Medical Informatics (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a document visual language reasoning method based on layout perception prompt, which utilizes a large language model to carry out document reasoning with rich visual information, integrates the text information of a document image with the visual information, introduces layout information through prompt learning, guides the large language model to understand the relation between the text and the visual content in a problem, uses the information to improve the context learning to generate an answer, enables a single-mode large language model to also process multi-mode document visual question-answering tasks, helps the large language model to achieve ideal effects on less sample learning, and has generalization in 3 different document visual question-answering dataset testing methods.

Description

Document visual language reasoning method based on layout perception prompt
Technical Field
The invention belongs to the field of document understanding in cross-modal understanding and prompt learning, and particularly relates to a document visual language reasoning method based on layout perception prompt, which is used for researching vulnerability of an existing large language model GPT-3 in robust capability of a training sample in a few samples under the condition of few samples and solving document visual language multi-modal reasoning tasks and how to utilize a context example learning reasoning process.
Background
Documents are essential to humans because they have been used to store historic knowledge and information. For this reason, a great deal of research effort has been conducted in improving the understanding of documents by machines. The research field of document analysis and recognition aims at automatically extracting information presented on paper, originally for human understanding. Visual question-answering (VQA) is a multi-modal deep learning for answering text-based questions about an image. There is a set of visual question-answering tasks defined based on various application scenarios, including statistical charts, daily life photographs, and digital birth documents. Among these visual question-and-answer tasks, document visual questions (Document Visual Question Answering) that aim to extract information from documents and answer natural language questions are more challenging. Given an input image and an associated natural language question, a document visual question-answering task is intended to provide natural language answers. In recent years, document visual questions and answers have become an important issue across computer vision, natural language understanding, and artificial intelligence. Many pre-training techniques regarding document analysis and recognition have been proposed and proved to be effective for various documents.
While these models have achieved encouraging results, research in this area is limited to retrieving documents through lexical content in word recognition, and is not seen in semantic view, ignoring information from these collections that extracts higher-level tasks. On the other hand, in the last few years, visual Questions and Answers (VQA) have been one of the major relevant tasks as a link between vision and language.
In order to accurately identify key text fields, it is inevitable to take advantage of the cross-modal nature of visually rich documents, where text, visual and layout information should be jointly modeled and end-to-end learned into a single framework. By way of pre-training-fine tuning, the pre-training model absorbs cross-modal knowledge from different document types, where local invariance between these layouts and styles is preserved. However, when the model needs to be transferred to another domain with a different document format, whether the model can be required to require only a few mark-up samples is sufficient to achieve the most advanced accuracy.
In order to answer the above-described problem about robust estimation of the model capability in the case of a document image with a small sample, a document image method of Prompting learning, i.e., layout-Aware sampling, has been studied. To design a Layout-Aware compacting, the following design criteria are adhered to: (1) The training data can cause a great reduction in the performance of the pre-training model under the condition of few samples; (2) The mapping relation between the text information and the layout information reveals that the large language model cannot well understand the relation between the text information and the layout information; (3) The data set employed should be resolvable, easily expandable and human readable.
In summary, the invention provides a document visual language reasoning method based on layout perception prompt, which aims to solve the problem of visual language multi-modal reasoning tasks of documents, and provides a learning method based on a context example, which is used for learning a reasoning process under the condition of few samples. The method can improve the performance of the model in the multi-mode reasoning task, has wide application prospect, and can be applied to the fields of natural language processing, computer vision, machine learning and the like. The prompt learning method based on the few-sample document image is developed from the following three aspects: 1) A prompt learning method named as Layout-Aware prompt is developed and is used for guiding a large language model GPT-3 to carry out visual language reasoning tasks; 2) The visual information of the document image is converted into text information by using the Layout-Aware mapping, and the mapping relation between the text information and the visual information is designed, so that GPT-3 can better understand the information conveyed by the data, and 3 document visual question-answer data sets are covered; 3) The visual document image understanding pre-training model is evaluated,
disclosure of Invention
The invention aims at: the method is used for researching a processing mode different from the existing document pre-training model, and guiding a large language model to solve a visual language reasoning task through context learning with the help of prompt learning, wherein the robustness of the large language model and the large language model is shown under the condition of data with few samples.
The invention relates to a document visual language reasoning method based on Layout perception prompt, which is named as Layout-Aware prompt and can help a large-scale language model to carry out document reasoning with rich visual information. The Layout-Aware scrolling is a prompt (scrolling) integrating text information and visual information of a document image, guides a large language model to understand the relationship between text and visual information in a question, and uses the prompt information to improve context learning to generate an answer. The method helps the large language model to exceed the existing document pre-training model in the study of few samples (few-shot), and has generalization in 3 different document visual question-answer data set testing methods.
The document visual question and answer data sets are DocVQA (document visual question and answer), infographicVQA (information image visual question and answer), and visual mrc (visual machine reading understanding), respectively, which are all related to answering questions about visual contents. However, there are some important differences between them:
(1) DocVQA: docVQA is a typical VQA style task in which natural language questions are defined on a single page document and answers need to be generated by interpreting document images. A list of predefined responses is not given and therefore the problem cannot be easily seen as an n-way classification task. Focus is on answering questions about a document, such as a PDF file or a text scanned image. These questions are typically about the document content, e.g. "what is the name of the author? What is the meaning of the "or" second segment? The "DocVQA system typically uses text recognition and layout analysis to extract relevant information from documents and answer questions. The original DocVQA data consisted of 10,194/1,286/1,287 images, each containing 39,463/5,349/5,188 questions for training/validation/testing;
(2) InfographicVQA: infographicVQA is similar to DocVQA, but focuses on answering questions about information graphics, which are visual representations of information, data, or knowledge. These questions are typically related to the content of the information graph, such as "what are the percentages of people like apples but not oranges? The "InfographicVQA system typically uses computer vision techniques to analyze the visual elements of the information graph and extract relevant information. The original InfographicVQA data consisted of 4,406/500/579 images, each containing 23,946/2,801/3,288 questions for training/validation/testing;
(3) Visual mrc: visual mrc is a more general task involving answering questions of any type of visual content, including photographs, images, and video. These questions may be about any content that is visible in the visual content, such as "what color is the car in the picture? How many people in the "or" crowd? The "visual mrc system typically uses a combination of computer vision and natural language processing techniques to analyze visual content and answer questions. The original visual mrc data consisted of 9,574/956/2,237 images, each containing 21,015/2,839/6,708 questions for training/validation/testing;
in summary, docVQA and infograpicvqa are more specific visual mrc types, focusing on answering questions about documents and information graphics, respectively, while visual mrc is a more general task that can be applied to any type of visual content.
The visual language reasoning task document visual question answering focuses on a specific type of visual question answering task, where visual understanding of information on the document image is necessary to provide an answer. This is not just the transfer of document images by Optical Character Recognition (OCR), but also includes understanding all types of information conveyed by the document. Document visual problems refer to visual elements and design problems involved in document design and layout processes. These problems may include the following:
(1) Layout and typesetting: this includes how to organize the content in the document, select fonts and font sizes, arrange paragraphs and titles, and so forth. Good layout and typesetting can help readers more easily understand and absorb the contents of a document.
(2) Image and chart: images and charts in documents should be used to support text content and to be able to convey information clearly. The designer needs to consider how to select the most appropriate image and chart types and how to integrate them into the overall design of the document.
(3) Color and font: colors and fonts can affect the readability and visibility of the document. The designer needs to select colors and fonts appropriate for the document theme and target reader to ensure that the document is easy to read and understand.
(4) Blank and space: appropriate spaces and spacings may help the content of the document to be better presented and make the document easier to read. The designer needs to consider how to balance text and blanks, and how to use spacing to help the reader distinguish between different text paragraphs and sections.
(5) Title and chapter: good title and chapter structures can make it easier for readers to find the desired information and help them understand the organization of the document better. The designer needs to consider how to select the best title and chapter structure and ensure that these elements are coordinated with the overall design of the document.
In summary, document visual problems are various visual elements and design issues that need to be considered in document design and layout processes. By carefully considering these issues, a designer can create a high quality document that is easy to read and understand.
The large language models, such as GPT-3, their context learning scenario can be seen as a conditional text generation problem. Specifically, the probability of generating the target text y depends on the text field associated with the target text in the input extraction data, containing the context C of k examples and the pre-prediction text x to be predicted. Thus, the prediction target text y corresponding to the pre-prediction text x to be predicted can be expressed as:
wherein LM represents parameters of the language model, the tth token (token) y <t Is relative to the current word y to be predicted t (t=1, 2,., T) the previously predicted logograms, together dividing a string into T (C, x, y t ) Format of c= { x 1 ,y 1 ,x 2 ,y 2 ,...,x k ,y k The letter string is }, where x i ,y i (i=1, 2,., k) are the front and rear, respectively, of the i-th context string, with text format similar to x, y. In GPT-3, C is created by concatenating k training instances and their corresponding text.
The hints refer to using specific text or language hints to guide the model to generate specific output. In the invention, a specific text or language prompt is called as an example, and is used as an input of a large language model, and an example selection method is to adopt an all-mpnet-base-v2 model in a source-transformers algorithm, map sentences and paragraphs into a 768-dimensional dense vector space, and calculate a retrieval sample corresponding to a retrieval sample problem with highest semantic cosine similarity to a test sample problem as an example sample in a prompt design:
wherein A and B are sentences to be calculated, which correspond to the test sample problem and the retrieval sample problem, cosine_sim (A and B) refers to the semantic cosine similarity of the sentence A and the sentence B, dot (A and B) refers to the dot product of sentence vectors A and B, and I A and B I respectively refer to Euclidean distance (size) of the vectors A and B. The difference in effect between the choice of semantic similarity and the choice of random is compared in subsequent experiments.
The document pre-training model comprises a pre-training model of text: BERT and RoBEATA; text and layout model LiLT; text, layout, and image model pre-training models: layoutLM, layoutLMv2, layoutlmv3, ERNIELayout. The benchmark experiments evaluate the robustness of the pre-trained visual document understanding model on the fine tuning of the downstream task in all samples to at least one sample, respectively.
The invention provides a document visual language reasoning method based on layout perception prompt, which is used for researching the vulnerability of the conventional large language model GPT-3 in the robust ability of training samples in a few samples under the condition of few samples and solving document visual language multi-modal reasoning tasks (step 4) by utilizing the context example learning reasoning process. The method comprises the following steps:
step 1: data preprocessing, namely selecting three data sets to perform experiments, namely DocVQA, infographicVQA and visual MRC, performing preprocessing operations, such as denoising, binarization, rotation correction, inclination correction and the like, on a document image in the data sets through an Optical Character Recognition (OCR) algorithm on any one of the 3 data sets to improve the accuracy and efficiency of subsequent processing, extracting characteristic information, such as outline, edge, projection and the like, of characters from the document image, and performing character recognition, wherein the character recognition is to match the characteristic information with a trained Optical Character Recognition (OCR) model to determine characters in the document image, and finally outputting recognition results into text formats which can be edited and processed by a computer, such as TXT, JSON and the like. The method comprises the steps of executing the operations on all 3 data sets, respectively extracting text information in data in the three data sets, sorting layout information and information about questions, question numbers, answers and the like in the data sets into data texts of corresponding JSON format files, and dividing the data texts into a retrieval data set and a test data set according to preset proportion;
step 2: an example sample is chosen that is used to help the large language model understand the data format and question-and-answer form required for the task. The method comprises the steps of forming a set A by the aid of the data text of a JSON format file obtained in the step 1, extracting the problem of any one test sample in a test data set by the aid of the problems of all search samples in the search data set, searching in the set A by the aid of an all-mpnet-base-v2 model in a sense-transformers algorithm, and respectively calculating the semantic cosine similarity of the problem of the test sample and each problem in the set A, so that the problem of a search sample with the highest semantic similarity to the problem of the test sample is searched, wherein the search sample corresponding to the problem of the search sample is used as an example sample in prompt design, and is used for prompt design in the step 3, and the all-mpnet-base-v2 model maps sentences and paragraphs to 768-dimensional dense vector space;
step 3: a prompt (prompt) is designed to guide the model to generate a particular output using a particular text or language prompt. There are 3 kinds of cues: plain text cues, text and layout discrete cues, and layout aware cues. (1) Plain text prompts, i.e. only text information of the document data has no layout information; (2) The text and layout discrete prompt is a format in which the text and the layout are respectively added into the prompt and the prompt header is designed to inform the model of text data and layout data, and a simple corresponding relation between the text and the layout data; (3) The layout-aware hint is to sort the data of the sample and the test sample obtained in the step 2 into a hint, and design the hint into a data stream format (hint header, context sample, test sample), wherein the hint header (prompting head) is used for informing the GPT-3 context sample (in-context demonstration) and the test sample (testing demonstration) how the data format is, and the hint answers the question according to the text information and the layout information, and the specific format is "the data format is { text: boxes } (i.e., { text: corresponding OCR boxes }), wherein the OCR box corresponding to each text is defined by four coordinates: [ x1, y1, x2, y2]. Where x1 and y1 refer to The abscissa of The upper left corner of The OCR frame, x2 and y2 refer to The abscissa of The lower right corner of The OCR frame, respectively, x1, y1, x2, y2 are used to represent The position of The OCR frame in The document, please answer questions according to The data format described above, i.e., "The data form is { text: boxes }, where each boxes is defined by four coordinates: [ [ x1, y1, x2, y2] ]. The x1 and y1 refer to The horizontal and vertical coordinates of The upper-left corner of The OCR boxes, and The x2 and y2 refer to The horizontal and vertical coordinates of The lower-right corner of The OCR boxes which indicate The position of The OCR boxes within The document, please answer The question according to The above data form ]. The text information and the layout information extracted in the step 1 are designed into a mapping relation { text: corresponding OCR boxes }, namely { text: boxes }, which is used as a data format of a context sample and a test sample, and the format is called as a 'layout perception prompt'. The context sample is that in step 2, a large language model, such as GPT-3, is guided according to the example data selected by the closest similarity to the problem semantics of the test sample, and the understanding is that the question-answer reasoning task data patterns are to be processed, and the context sample comprises a context layout perception prompt, a context sample problem and a context sample answer; the test sample comprises a test sample layout perception prompt and a test sample problem. Finally, comparing test results of three prompting modes, namely a plain text prompting mode, a text and layout discrete prompting mode and a layout perception prompting mode through experiments;
step 4: the designed hints are passed to GPT-3 for a few sample document visual question-answer reasoning task, using the average normalized Levenstein distance (Average Normalized Levenshtein distance) criterion to evaluate the accuracy of generating the answer:
wherein lev is a,b (i, j) represents the levenstein distance between the first i characters of string a and the first j characters of string b; 1 (ai≠bj) Is an indication function when a i =b j When it is 0, otherwise it is equal to 1, a i The ith character, b, representing string a j The j-th character of the character string b. The first formula lev in min operation a,b (i-1, j) +1 represents deleting a character from the character string a to reach b; the second formula lev a,b (i, j-1) +1 represents inserting a character from the character string a to arrive at b; the third formula lev a,b (i-1,j-1)+1 (ai≠bj) Representing replacement of a character from string a to reach b (depending on whether the current character is identical). In linguistics, the levenstein distance is used as a measure for quantifying the distance between texts, i.e. the difference between two texts. It relates to mutual understandability: the higher the text distance, the lower the mutual intelligibility, the lower the text distance, the higher the mutual intelligibilityHigh. Comparing different conditions of randomly selected example samples with similar semantics in subsequent experiments; min, max represent operations taking the minimum and maximum values, respectively;
step 5: the robustness of the different models with few samples (few-shot) was investigated. Different document pre-training models are adopted, wherein the document pre-training models comprise text pre-training models: BERT and RoBEATA; text and layout model LiLT; text, layout, and image model pre-training models: layoutLM, layoutLMv2, layoutlmv3, ERNIELayout. The parameters set for the 3 different sets of document visual question-answer data are the same: setting the learning rate to 2×e -5 The training epoch was set to 40 and all input images were 224x 224 pixels in resolution. The batch in training during full sample training is set to 4, the batch in testing is set to 1, the batch in training during few samples training is set to 1, the batch in testing is set to 1, and the reasoning effect of the models under the condition of few samples is compared. Furthermore, cases of different numbers of example samples in the case of a small number of samples, and different numbers of problems in the different example samples were explored.
Compared with the prior art, the invention has the following advantages and beneficial effects:
(1) The prior multi-modal model is used for solving the document visual question-answering task in a pre-training-fine-tuning paradigm, but the method is quite time-consuming and has quite high requirements on the configuration of a machine, in order to solve the problems, a large language model is guided to do an reasoning task through prompt learning, the method is simple and convenient, the task can be evaluated in 3-6 hours in general, the pre-training-fine-tuning paradigm usually needs a plurality of A100 GPUs to complete the evaluation in a plurality of hours, and the reasoning accuracy of a specific task is higher than that of the model of the optimal pre-training-fine-tuning paradigm;
(2) The method has the advantages that the layout information is introduced through prompt learning, the large language model is guided to understand the relation between the text and the visual content in the problem, the information is used for improving the context learning to generate an answer, the single-mode large language model can also process the multi-mode document visual question-answering task, and the accuracy is higher than that of the answer in a common mode;
(3) The multi-modal model is evaluated and compared with the Layout-Aware Prompting under the condition of few samples in different document visual question and answer data sets, and the proposed prompt learning guiding large language model mode, demonstration research and deep analysis are expected to be beneficial to future research so as to improve the accuracy of the model in answering the document visual reasoning task.
Drawings
FIG. 1 is a sample view of a document image;
FIG. 2 is a flow chart of a method of implementing the present invention;
FIG. 3 is a schematic diagram of a design hint learning method of the present invention;
FIG. 4 is a schematic diagram of an exemplary selection method of the present invention.
Detailed Description
The present invention will be described in further detail below with reference to the embodiments and the accompanying drawings, so that those skilled in the relevant art can better understand the present invention. It should be particularly noted that the described embodiments are some, but not all embodiments of the invention and are not intended to limit the scope of the invention as claimed. All other embodiments obtained by those skilled in the art without making any inventive effort are within the scope of the present invention.
Considering that the existing document pre-training model after fine tuning solves the document visual question-answering task in a pre-training-fine tuning model, the method is quite time-consuming and has quite high requirements on machine configuration, and the method has poor effect on processing few sample data (few-shot). The invention provides a document visual language reasoning method based on layout perception prompt, which is used for guiding a large language model to understand information to be transmitted by data by means of strong contextual learning capability and is used for solving a document visual language reasoning task. The guided large language model is able to understand the relationship between text and visual content in the question and use this information to improve the context learning to generate answers. The method helps the large language model to achieve an ideal effect on the study of few samples (few-shot), and is generalization of the method for testing 3 different document visual question-answer data sets.
The document visual question and answer data sets are DocVQA (document visual question and answer), infographicVQA (information chart visual question and answer) and visual mrc (visual machine reading understanding), respectively, which are all related to answering questions about visual contents. However, there are some important differences between them:
(1) DocVQA: docVQA is a typical VQA style task in which natural language questions are defined on a single page document and answers need to be generated by interpreting document images. A list of predefined responses is not given and therefore the problem cannot be easily seen as an n-way classification task. Focus is on answering questions about a document, such as a PDF file or a text scanned image. These questions are typically about the document content, e.g. "what is the name of the author? What is the meaning of the "or" second segment? The "DocVQA system typically uses text recognition and layout analysis to extract relevant information from documents and answer questions. The original DocVQA data consisted of 10,194/1,286/1,287 images, each containing 39,463/5,349/5,188 questions for training/validation/testing;
(2) InfographicVQA: infographicVQA is similar to DocVQA, but focuses on answering questions about information graphics, which are visual representations of information, data, or knowledge. These questions are typically related to the content of the information graph, such as "what are the percentages of people like apples but not oranges? The "InfographicVQA system typically uses computer vision techniques to analyze the visual elements of the information graph and extract relevant information. The original InfographicVQA data consisted of 4,406/500/579 images, each containing 23,946/2,801/3,288 questions for training/validation/testing;
(3) Visual mrc: visual mrc is a more general task involving answering questions of any type of visual content, including photographs, images, and video. These questions may be about any content that is visible in the visual content, such as "what color is the car in the picture? How many people in the "or" crowd? The "visual mrc system typically uses a combination of computer vision and natural language processing techniques to analyze visual content and answer questions. The original visual mrc data consisted of 9,574/956/2,237 images, each containing 21,015/2,839/6,708 questions for training/validation/testing;
in summary, docVQA and infograpicvqa are more specific visual mrc types, focusing on answering questions about documents and information graphics, respectively, while visual mrc is a more general task that can be applied to any type of visual content.
The visual language reasoning task document visual question answering (DocVQA) focuses on a specific type of visual question answering task, where visual understanding of information on the document image is necessary to provide an answer. This is not just the transfer of document images through OCR, but also includes understanding all types of information conveyed by the document. Text content (handwriting or typing), non-text elements (labels, checkmarks, separators, charts), layout (page structure, forms, tables), style (fonts, colors, highlighting), etc., which are information that may be needed when answering a question at hand. Document visual problems refer to visual elements and design problems involved in document design and layout processes. These problems may include the following:
(1) Layout and typesetting: this includes how to organize the content in the document, select fonts and font sizes, arrange paragraphs and titles, and so forth. Good layout and typesetting can help readers more easily understand and absorb the contents of a document.
(2) Image and chart: images and charts in documents should be used to support text content and to be able to convey information clearly. The designer needs to consider how to select the most appropriate image and chart types and how to integrate them into the overall design of the document.
(3) Color and font: colors and fonts can affect the readability and visibility of the document. The designer needs to select colors and fonts appropriate for the document theme and target reader to ensure that the document is easy to read and understand.
(4) Blank and space: appropriate spaces and spacings may help the content of the document to be better presented and make the document easier to read. The designer needs to consider how to balance text and blanks, and how to use spacing to help the reader distinguish between different text paragraphs and sections.
(5) Title and chapter: good title and chapter structures can make it easier for readers to find the desired information and help them understand the organization of the document better. The designer needs to consider how to select the best title and chapter structure and ensure that these elements are coordinated with the overall design of the document.
In summary, document visual problems are various visual elements and design issues that need to be considered in document design and layout processes. By carefully considering these issues, a designer can create a high quality document that is easy to read and understand.
The large language models, such as GPT3, their context learning scenario can be seen as a conditional text generation problem. Specifically, the probability of generating the target text y depends on the text field associated with the target text in the input extraction data, containing the context C of k examples and the pre-prediction text x to be predicted. Thus, the prediction target text y corresponding to the pre-prediction text x to be predicted can be expressed as:
where LM represents parameters of the language model, the tth token (token) is a term that is unique to the current predicted token (t=1, 2.), T) previously predicted tokens, a total of one segment of a string may be divided into T (C, x, y) formats, c= { x 1 ,y 1 ,x 2 ,y 2 ,...,x k ,y k The letter string is }, where x i ,y i (i=1, 2,., k) are the front and rear, respectively, of the i-th context string, with text format similar to x, y. In GPT-3, C is created by concatenating k training instances and their corresponding text.
The hints refer to using specific text or language hints to guide the model to generate specific output. In the invention, a specific text or language prompt is called as an example, and is used as an input of a large language model, and an example selection method is to adopt an all-mpnet-base-v2 model in a source-transformers algorithm, map sentences and paragraphs into a 768-dimensional dense vector space, and calculate a retrieval sample corresponding to a retrieval sample problem with highest semantic cosine similarity to a test sample problem as an example sample in a prompt design:
wherein A and B are sentences to be calculated, which correspond to the test sample problem and the retrieval sample problem, cosine_sim (A and B) refers to the semantic cosine similarity of the sentence A and the sentence B, dot (A and B) refers to the dot product of sentence vectors A and B, and I A and B I respectively refer to Euclidean distance (size) of the vectors A and B. The difference in effect between the choice according to semantic similarity and the random choice is compared in the subsequent ablation experiments.
The document pre-training model comprises a pre-training model of text: BERT and RoBEATA; text and layout model LiLT; text, layout, and image model pre-training models: layoutLM, layoutLMv2, layoutlmv3, ERNIELayout. The benchmark experiments evaluate the robustness of the pre-trained visual document understanding model on the fine tuning of the downstream task in all samples to at least one sample, respectively.
Step 1: the method comprises the steps of preprocessing data, selecting three data sets for experiments, including DocVQA, infographicVQA and visual MRC, preprocessing a document image of any one of the 3 data sets through an Optical Character Recognition (OCR) algorithm, such as denoising, binarization, rotation correction, inclination correction and the like, so as to improve the accuracy and efficiency of subsequent processing, extracting character characteristic information such as outlines, edges, projections and the like from the document image, and performing character recognition, wherein the character recognition is to match the characteristic information with an Optical Character Recognition (OCR) model which is trained to determine characters in the image, and finally outputting recognition results into text formats which can be edited and processed by a computer, such as TXT, JSON and the like. Extracting text information in data in three data sets through the operation, arranging information about questions, question numbers, answers and the like in the layout information and the data sets into data texts of corresponding JSON format files, and dividing the data texts obtained by each data set into a retrieval data set and a test data set;
step 2: an example sample (sample) is chosen that is used to help the large language model understand the data format and question-answer form required for the task. And (3) forming a set A by the data text of the JSON format file obtained in the step (1) from the problems of all the search samples in the search data set, extracting the problem of any one test sample in the test data set, searching in the set A by an all-mpnet-base-v2 model in a sense-transformers algorithm, and respectively calculating the semantic cosine similarity of the problem of the test sample and each problem in the set A so as to search the problem of the search sample with the highest semantic similarity with the problem of the test sample, wherein the search sample corresponding to the problem of the search sample is used as an example sample in the prompt design for the prompt design of the step (3), and the all-mpnet-base-v2 model maps sentences and paragraphs to a 768-dimensional dense vector space. In a different alternative, the effect exhibited by Layout-Aware sampling is also affected, and table 1 shows that the use of semantic similar sampling examples in DocVQA is better than the random sampling examples, where ANLS ∈represents the average normalized levenstein distance, the higher the text distance, the lower the mutual intelligibility, the lower the text distance, and the higher the mutual intelligibility, so the better the ANLS value;
table 1 test results for different alternatives
Step 3: a prompt (prompt) is designed to guide the model to generate a particular output using a particular text or language prompt. There are 3 kinds of cues: plain text cues, text and layout discrete cues, and layout aware cues. (1) Plain text prompting, namely, only text information of document data is added without layout information; (2) The text and layout discrete prompt is a format in which the text and the layout are respectively added into the prompt and the prompt header is designed to inform the model of text data and layout data, and a simple corresponding relation between the text and the layout data; (3) The layout-aware hint is to sort the data of the sample and the test sample obtained in the step 2 into a hint, and design the hint into a data stream format (hint header, the above sample, test sample), wherein the hint header (prompting head) functions to inform the GPT-3 of how the data format of the above sample (in-context demonstration) and the test sample (testing demonstration) is, and hint that the question is answered according to the text information and the layout information, and the specific format is "data format is { text: boxes } ({ text: corresponding OCR frame }), wherein each frame is defined by four coordinates: [ [ x1, y1, x2, y2] ]. Where x1 and y1 refer to The abscissa of The upper left corner of The OCR frame, x2 and y2 refer to The abscissa of The lower right corner of The OCR frame, indicating The position of The OCR frame in The document, please answer questions according to The data format described above, "i.e.," The data form is { text: boxes }, where each boxes is defined by four coordinates: [ [ x1, y1, x2, y2] ]. The x1 and y1 refer to The horizontal and vertical coordinates of The upper-left corner of The OCR boxes, and The x2 and y2 refer to The horizontal and vertical coordinates of The lower-right corner of The OCR boxes which indicate The position of The OCR boxes within The document, please answer The question according to The above data form "). And then, the text information and the layout information extracted in the step 1 are designed into a mapping relation { text: corresponding OCR boxes }, namely { text: boxes }, which is used as a data format of the context sample and the test sample, and the format is called as a 'layout perception prompt'. The context sample is that in the step 2, GPT-3 is guided to understand that the question-answer reasoning task data patterns are to be processed according to the selected example data which is most similar to the problem semantics of the test sample, and the context sample comprises a context layout perception prompt, a context sample problem and a context sample answer; the test sample comprises a test sample layout perception prompt and a test sample problem. Table 2 shows whether the text information and the layout information are designed into different effects of the mapping relation, which proves that the effect of the mapping relation of the text information and the layout information on guiding GPT-3 is better; table 2 shows the effect of the mapping with or without layout information and the average accuracy of the model in the case of different data sets and different samples in both cases of designing the text information and the layout information into a mapping relationship and not into a mapping relationship. The result shows that the effect of no layout information is poor, and the effect of designing the text information and the layout information into a mapping relation is better for guiding GPT-3. Specifically, the design of the mapping relationship can significantly improve the accuracy of the model in inferentially answering questions, as compared to the case where the mapping relationship is not designed. This shows that by designing the text information and the layout information into a mapping relation, the layout information can be better utilized to guide the model to predict, so that the accuracy of the model to perform reasoning and answer questions is improved.
Table 2 prompts test results for different formats of samples
Note that: w/o boxes indicate no position information, w/boxes indicate position information, split indicates that text information is discrete with the position information, mapping indicates that the text information is mapped with the position information, ANLS ∈indicates average normalized Levenstein distance, and higher value indicates better effect
Step 4: the designed hints are passed to GPT-3 for a few sample document visual question-answer reasoning task, using the average normalized Levenstein distance (Average Normalized Levenshtein distance) criterion to evaluate the accuracy of generating the answer:
wherein lev is a,b (i, j) represents a column venlstein distance between the first i characters of character string a and the first j characters of character string b;is an indication function whena i =b j When it is 0, otherwise it is equal to 1, a i The ith character, b, representing string a j The j-th character of the character string b. The first formula lev in min operation a,b (i-1, j) +1 represents deleting a character from the character string a to reach b; second formula j a,b (i, j-1) +1 represents inserting a character from the character string a to arrive at b; third formula->Representing replacement of a character from string a to reach b (depending on whether the current character is identical). min, max represent minimum and maximum operations, respectively. In linguistics, the levenstein distance is used as a measure for quantifying the distance between texts, i.e. the difference between two texts. It relates to mutual understandability: the higher the text distance, the lower the mutual intelligibility, and the lower the text distance, the higher the mutual intelligibility. The different numbers of samples and the number of problems in the samples affect the final effect, but not as much as possible, table 3 shows the average accuracy of the method of the invention with different numbers of samples and the number of problems in the samples. The results show that in the DocVQA dataset, the best case is 1 text content and 4 questions in the case of 1 sample (1-shot). This may be due to the fact that the problems in the sample and the test sample are semantically similar, illustrating that step 2 plays a significant role in designing the context learning example. Because step 2 is designed to closely relate the context information of the example sample to the problem, the model is better able to infer and generalize. Therefore, the results show that in the learning of few samples, the reasonable design of the context information of the sample can obviously improve the effect of reasoning the model-decomposed samples, thereby improving the accuracy of answering the questions; />
TABLE 3 test results for different sample numbers and number of questions in a sample on a document visual question-answer dataset
Step 5: the robustness of the different models with few samples (few-shot) was investigated. Different document pre-training models are adopted, wherein the document pre-training models comprise text pre-training models: BERT and RoBEATA; text and layout model LiLT; text, layout, and image model pre-training models: layoutLMv1, layoutLMv2, layoutLMv3, ERNIELayout. The inference effects of these models with few samples are compared. For the document visual question-answering task, the parameters set by the 3 different document visual question-answering data sets are the same: the learning rate is set to 2 Xe -5 The training epoch was set to 40 and all input images were 224x 224 pixels in resolution. The batch in training with full sample training was set to 4, the batch in test was set to 1, the batch in training with few samples training was set to 1, the batch in test was set to 1, and Table 4 shows the reasoning effect of comparing different models with full sample and few samples. The results show that the effect of existing document pre-training models varies very much from full sample to few samples, indicating that they are less generalizable with few samples. In contrast, the effect of the Layout-aware sampling (outer in table 4) is better in the case of few samples, but there is a gap from the effect of the existing document pre-training model in the case of full samples. This may be due to the fact that the unimodal large language model does not yet have as good a perception of vision as the vision module provided in the existing document pre-training model. Thus, these results indicate that existing document pre-training models may have certain limitations when dealing with small sample data, and that Layout-aware profiling may be an effective solution, but still require further improvements to improve its effectiveness.
Table 4 benchmark experimental results
Note that: the outer represents the method of the invention, full-sample represents the whole sample condition, few-shot represents the few sample condition, ANLS ∈represents the average normalized Levenstein distance, and the higher the value is, the better the effect is
While the invention has been described in terms of specific embodiments, any feature disclosed in this specification may be replaced by alternative features serving the equivalent or similar purpose, unless expressly stated otherwise; all of the features disclosed, or all of the steps in a method or process, except for mutually exclusive features and/or steps, may be combined in any manner.

Claims (7)

1. A document visual language reasoning method based on layout perception prompt is characterized by comprising the following steps:
step 1: preprocessing data;
acquiring a data set, performing preprocessing operation on a document image in the data set through an optical character recognition algorithm to improve the accuracy and efficiency of subsequent processing, extracting character characteristic information from the document image for character recognition, wherein the character recognition is to match the characteristic information with a trained optical character recognition model to determine characters in the document image, and finally outputting a recognition result into a text format which can be edited and processed by a computer, so as to extract text information of data in the data set, layout information and information about questions, question numbers and answers in the data set, and sort the information into data texts of corresponding JSON format files, and dividing the data texts into a retrieval data set and a test data set according to a preset proportion;
step 2: selecting an example sample, wherein the example sample is used for helping a large language model to understand a data format and a question-answer form required by a task;
the method comprises the steps of forming a set A of problems of all search samples in a search data set, extracting problems of any one test sample in the test data set, searching in the set A through an all-mpnet-base-v2 model in a content-converters algorithm, and respectively calculating semantic cosine similarity of the problems of the test sample and each problem in the set A so as to search out the problem of the search sample with the highest semantic similarity to the problems of the test sample, wherein the search sample corresponding to the problem of the search sample is used as an example sample in a prompt design, and the all-mpnet-base-v2 model maps sentences and paragraphs to a 768-dimensional dense vector space;
step 3: designing a layout perception prompt;
the prompt means that a specific text or language prompt is used for guiding a large language model to generate specific output, the data of the example sample and the test sample obtained in the step 2 are arranged in the prompt, the prompt is designed into a data flow format of a prompt head, a context sample and a test sample, and the text information and the layout information extracted in the step 1 are designed into a mapping relation { text: corresponding OCR box }, and are used as the data formats of the context sample and the test sample;
step 4: and transmitting the designed prompt to a large language model to perform a small sample document visual question-answer reasoning task, and generating an answer.
2. The method for visual language reasoning of a document based on layout-aware cues as defined in claim 1, wherein the preprocessing operation comprises denoising, binarization, rotation correction, tilt correction.
3. The method for visual language reasoning of a document based on layout-aware cues as claimed in claim 2, wherein the characteristic information of the character comprises outline, edge, projection.
4. A method of visual language reasoning for documents based on layout aware cues as claimed in claim 3, characterised in that the text formats that the computer can edit and process include TXT, JSON.
5. The method for visual language reasoning of a document based on layout awareness cues as defined in claim 4, in which the context sample comprises a context layout awareness cue, a context sample question, and a context sample answer; the test sample comprises a test sample layout perception prompt and a test sample problem.
6. The method for visual language reasoning of a document based on layout-aware cues as defined in claim 5, in which the large language model selects GPT-3.
7. The method for visual language reasoning about documents based on layout aware cues as defined in claim 6, wherein the dataset is DocVQA, infographicVQA or visual mrc.
CN202310817907.6A 2023-07-05 2023-07-05 Document visual language reasoning method based on layout perception prompt Pending CN116822634A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310817907.6A CN116822634A (en) 2023-07-05 2023-07-05 Document visual language reasoning method based on layout perception prompt

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310817907.6A CN116822634A (en) 2023-07-05 2023-07-05 Document visual language reasoning method based on layout perception prompt

Publications (1)

Publication Number Publication Date
CN116822634A true CN116822634A (en) 2023-09-29

Family

ID=88116502

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310817907.6A Pending CN116822634A (en) 2023-07-05 2023-07-05 Document visual language reasoning method based on layout perception prompt

Country Status (1)

Country Link
CN (1) CN116822634A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117573839A (en) * 2024-01-12 2024-02-20 阿里云计算有限公司 Document retrieval method, man-machine interaction method, electronic device and storage medium
CN117827638A (en) * 2023-11-16 2024-04-05 中国人民银行数字货币研究所 Test data generation method and device

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117827638A (en) * 2023-11-16 2024-04-05 中国人民银行数字货币研究所 Test data generation method and device
CN117573839A (en) * 2024-01-12 2024-02-20 阿里云计算有限公司 Document retrieval method, man-machine interaction method, electronic device and storage medium
CN117573839B (en) * 2024-01-12 2024-04-19 阿里云计算有限公司 Document retrieval method, man-machine interaction method, electronic device and storage medium

Similar Documents

Publication Publication Date Title
Coquenet et al. Dan: a segmentation-free document attention network for handwritten document recognition
Singh et al. Full page handwriting recognition via image to sequence extraction
AU2020279921B2 (en) Representative document hierarchy generation
Tito et al. Hierarchical multimodal transformers for multipage docvqa
Wilkinson et al. Neural Ctrl-F: segmentation-free query-by-string word spotting in handwritten manuscript collections
CN116822634A (en) Document visual language reasoning method based on layout perception prompt
Boillet et al. Robust text line detection in historical documents: learning and evaluation methods
Almutairi et al. Instance segmentation of newspaper elements using mask R-CNN
Cheng et al. M6doc: A large-scale multi-format, multi-type, multi-layout, multi-language, multi-annotation category dataset for modern document layout analysis
US20240249545A1 (en) Visual Structure of Documents in Question Answering
CN112560849B (en) Neural network algorithm-based grammar segmentation method and system
CN118035416A (en) Method and system for streaming question-answer map
CN113569112A (en) Tutoring strategy providing method, system, device and medium based on question
Al Ghamdi A novel approach to printed Arabic optical character recognition
CN113673294B (en) Method, device, computer equipment and storage medium for extracting document key information
Kang et al. Multi-Page Document Visual Question Answering using Self-Attention Scoring Mechanism
Nguyen et al. Handwriting recognition and automatic scoring for descriptive answers in Japanese language tests
CN113362026A (en) Text processing method and device
CN112036330A (en) Text recognition method, text recognition device and readable storage medium
CN114332476B (en) Method, device, electronic equipment, storage medium and product for recognizing wiki
CN114579796B (en) Machine reading understanding method and device
Islam et al. Line extraction in handwritten documents via instance segmentation
CN116030469A (en) Processing method, processing device, processing equipment and computer readable storage medium
Desai et al. A Survey On Automatic Subjective Answer Evaluation
JP2006309347A (en) Method, system, and program for extracting keyword from object document

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination