CN116822634A

CN116822634A - Document visual language reasoning method based on layout perception prompt

Info

Publication number: CN116822634A
Application number: CN202310817907.6A
Authority: CN
Inventors: 姚俊豪; 徐行; 王磊; 沈复民; 申恒涛
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2023-07-05
Filing date: 2023-07-05
Publication date: 2023-09-29

Abstract

The invention discloses a document visual language reasoning method based on layout-aware prompts. The method uses a large language model to perform visually information-rich document reasoning, integrates text information and visual information of document images, and introduces layout information through prompt learning. , guide large language models to understand the relationship between text and visual content in questions, and use this information to improve context learning to generate answers, allowing single-modal large language models to also handle multi-modal document visual question and answer tasks, helping large languages The model achieves ideal results in few-shot learning, and the generalization of the method is tested on three different document visual question and answer data sets.

Description

A document visual language reasoning method based on layout-aware cues

技术领域Technical Field

本发明属于跨模态理解与提示学习中的文档理解领域，具体涉及一种基于布局感知提示的文档视觉语言推理方法，用于研究现有大语言模型GPT-3在提示学习的引导下解决文档视觉语言多模态推理任务，以及在少样本情况下，对于如何利用上下文示例学习推理过程，以及分析现有的文档预训练模型在训练样本处于少样本时鲁棒能力的脆弱性。The present invention belongs to the field of document understanding in cross-modal understanding and prompt learning, and specifically relates to a document visual language reasoning method based on layout-aware prompts, which is used to study how the existing large language model GPT-3 solves the document visual language multimodal reasoning task under the guidance of prompt learning, and how to use contextual examples to learn the reasoning process in the case of few samples, and analyze the fragility of the robustness of the existing document pre-training model when the training samples are in the form of few samples.

背景技术Background Art

文档对人类来说是必不可少的，因为它们已被用来存储历史上的知识和信息。出于这个原因，人们在提高机器对文档的理解方面进行了大量的研究工作。文档分析与识别的研究领域旨在自动提取纸上呈现的信息，最初是针对人类理解的。视觉问答(VQA)是一种多模态深度学习，用于回答有关图像的基于文本的问题。有一组基于各种应用场景定义的视觉问答任务，包括统计图表、日常生活照片和数字出生文档。在这些视觉问答任务中，旨在从文档中提取信息并回答自然语言问题的文档视觉问答(Document Visual QuestionAnswering)更具挑战性。给定输入图像和相关的自然语言问题，文档视觉问答任务旨在提供自然语言答案。近年来，文档视觉问答已成为跨越计算机视觉、自然语言理解和人工智能的重要问题。许多关于文档分析与识别的预训练技术被提出，并被证明对各种文档是有效的。Documents are indispensable to humans as they have been used to store knowledge and information throughout history. For this reason, a lot of research work has been done on improving machine understanding of documents. The research field of document analysis and recognition aims to automatically extract information presented on paper, originally targeting human understanding. Visual Question Answering (VQA) is a multimodal deep learning approach for answering text-based questions about images. There is a set of visual question answering tasks defined based on various application scenarios, including statistical charts, daily life photos, and digital birth documents. Among these visual question answering tasks, Document Visual Question Answering (VQA), which aims to extract information from documents and answer natural language questions, is more challenging. Given an input image and an associated natural language question, the document visual question answering task aims to provide a natural language answer. In recent years, document visual question answering has become an important problem spanning computer vision, natural language understanding, and artificial intelligence. Many pre-training techniques for document analysis and recognition have been proposed and proven to be effective for various documents.

尽管这些模型取得了令人鼓舞的结果，但该范围内的研究仅限于在词识别中通过词汇内容检索文档，对语义视而不见，忽略了提取更高层次的任务来自这些集合的信息。另一方面，在过去的几年中，视觉问答(VQA)作为视觉和语言之间的联系一直是主要的相关任务之一。Despite the promising results achieved by these models, research in this area is limited to retrieving documents by lexical content in word spotting, being blind to semantics and neglecting the task of extracting higher-level information from these collections. On the other hand, Visual Question Answering (VQA) has been one of the main related tasks in the past few years as a link between vision and language.

为了准确识别关键的文本字段，不可避免地要利用视觉丰富的文档的跨模态特性，其中文本、视觉和布局信息应该联合建模并端到端地学习一个单一的框架。通过预训练-微调的方式，预训练模型吸收了来自不同文档类型的跨模态知识，其中保留了这些布局和样式之间的局部不变性。但是，当模型需要转移到具有不同文档格式的另一个领域时，是否能要求模型只需几个标记样本就足以达到最先进的准确性。In order to accurately identify key text fields, it is inevitable to exploit the cross-modal nature of visually rich documents, where textual, visual, and layout information should be jointly modeled and learned end-to-end in a single framework. Through the pre-training-fine-tuning approach, the pre-trained model absorbs cross-modal knowledge from different document types, where local invariances between these layouts and styles are preserved. However, when the model needs to be transferred to another domain with a different document format, is it possible to require the model to achieve state-of-the-art accuracy with only a few labeled samples?

为了回答上述关于文档图像少样本情况下模型能力的鲁棒估计的问题，研究了一种提示学习的文档图像方法，即Layout-Aware Prompting。为了设计Layout-AwarePrompting，坚持以下设计标准：(1)训练数据存在少样本情况下会导致预训练模型性能的大幅下降；(2)文本信息与布局信息的映射关系揭示大语言模型不能很好理解二者之间的关系；(3)采用的数据集应该是可解决的、易于扩展的和人类可读的。In order to answer the above question about the robust estimation of model capabilities in the case of few sample document images, a document image prompting method, namely Layout-Aware Prompting, is studied. In order to design Layout-Aware Prompting, the following design criteria are adhered to: (1) Few sample training data will lead to a significant decline in the performance of pre-trained models; (2) The mapping relationship between text information and layout information reveals that large language models cannot understand the relationship between the two well; (3) The dataset used should be solvable, easy to extend, and human-readable.

综上所述，本发明提出了一种基于布局感知提示的文档视觉语言推理方法，旨在解决文档的视觉语言多模态推理任务，并提出了一种基于上下文示例的学习方法，用于在少样本情况下学习推理过程。该方法可以提高模型在多模态推理任务中的表现，具有广泛的应用前景，可以应用于自然语言处理、计算机视觉、机器学习等领域。基于少样本文档图像的提示学习方法从以下三方面展开：1)开发了一种名为Layout-Aware Prompting的提示学习方法，用于引导大语言模型GPT-3进行视觉语言推理任务；2)Layout-Aware Prompting将文档图像的视觉信息转化为文本信息，并将设计出文本信息与视觉信息之间的映射关系，使GPT-3能更好地理解数据所传达的信息，涵盖了3种文档视觉问答数据集；3)评估了视觉文档图像理解预训练模型，In summary, the present invention proposes a document visual language reasoning method based on layout-aware prompts, which aims to solve the document visual language multimodal reasoning task, and proposes a learning method based on contextual examples for learning the reasoning process in the case of few samples. This method can improve the performance of the model in multimodal reasoning tasks, has broad application prospects, and can be applied to natural language processing, computer vision, machine learning and other fields. The prompt learning method based on few-sample document images is developed from the following three aspects: 1) A prompt learning method called Layout-Aware Prompting is developed to guide the large language model GPT-3 to perform visual language reasoning tasks; 2) Layout-Aware Prompting converts the visual information of the document image into text information, and designs the mapping relationship between text information and visual information, so that GPT-3 can better understand the information conveyed by the data, covering 3 document visual question answering datasets; 3) The visual document image understanding pre-training model is evaluated,

发明内容Summary of the invention

本发明的发明目的在于：用于研究与现有的文档预训练模型不同的处理方式，在提示学习的帮助下，引导大语言模型通过上下文学习解决视语言推理任务，二者在少样本的数据情况下所表现出的鲁棒性。The purpose of the invention is to study a processing method different from the existing document pre-training model, and with the help of prompt learning, guide the large language model to solve the visual language reasoning task through context learning, and the robustness of the two in the case of few sample data.

本发明是一种基于布局感知提示的文档视觉语言推理方法，命名为Layout-AwarePrompting，它能够帮助大型语言模型进行视觉信息丰富的文档推理。Layout-AwarePrompting是一种将文档图像的文本信息与视觉信息集成的提示(prompting)，引导大语言模型能够理解问题中的文本与视觉信息之间的关系，并使用该提示信息改善上下文学习生成答案。帮助大型语言模型在少样本学习(few-shot)上能超过现有的文档预训练模型，并在3种不同的文档视觉问答数据集测试方法的泛化性。The present invention is a document visual language reasoning method based on layout-aware prompting, named Layout-Aware Prompting, which can help large language models to reason about documents with rich visual information. Layout-Aware Prompting is a prompting that integrates the text information and visual information of a document image, guiding the large language model to understand the relationship between the text and visual information in the question, and using the prompt information to improve contextual learning to generate answers. It helps large language models to surpass existing document pre-training models in few-shot learning, and tests the generalization of the method on three different document visual question-answering datasets.

所述文档视觉问答数据集分别是DocVQA(文档视觉问答)、InfographicVQA(信息图像视觉问答)和VisualMRC(视觉机器阅读理解)，它们都与回答有关视觉内容的问题有关。然而，它们之间有一些重要的区别：The document visual question answering datasets are DocVQA (Document Visual Question Answering), InfographicVQA (Information Image Visual Question Answering), and VisualMRC (Visual Machine Reading Comprehension), all of which are related to answering questions about visual content. However, there are some important differences between them:

(1)DocVQA：DocVQA是典型的VQA风格的任务，其中自然语言问题被定义在单页文档上，并且需要通过解释文档图像来生成答案。不会给出预定义响应的列表，因此该问题不能被容易地视为n路分类任务。专注于回答有关文档的问题，例如PDF文件或文本扫描图像。这些问题通常是关于文档内容的，例如“作者的名字是什么？”或“第二段的大意是什么？”DocVQA系统通常使用文本识别和布局分析从文档中提取相关信息并回答问题。原始的DocVQA数据由10,194/1,286/1,287张图像组成，分别包含39,463/5,349/5,188个用于训练/验证/测试的问题；(1) DocVQA: DocVQA is a typical VQA-style task, where natural language questions are defined on a single-page document and the answer needs to be generated by interpreting the document image. No list of predefined responses is given, so the problem cannot be easily treated as an n-way classification task. It focuses on answering questions about documents, such as PDF files or text scanned images. These questions are usually about the content of the document, such as "What is the author's name?" or "What is the general idea of the second paragraph?" DocVQA systems usually use text recognition and layout analysis to extract relevant information from the document and answer the questions. The original DocVQA data consists of 10,194/1,286/1,287 images, containing 39,463/5,349/5,188 questions for training/validation/testing, respectively;

(2)InfographicVQA：InfographicVQA类似于DocVQA，但侧重于回答有关信息图形的问题，信息图形是信息、数据或知识的可视化表示。这些问题通常与信息图的内容有关，例如“喜欢苹果而不喜欢橘子的人的百分比是多少？”InfographicVQA系统通常使用计算机视觉技术来分析信息图的视觉元素并提取相关信息。原始的InfographicVQA数据由4,406/500/579张图像组成，分别包含23,946/2,801/3,288个用于训练/验证/测试的问题；(2) InfographicVQA: InfographicVQA is similar to DocVQA, but focuses on answering questions about infographics, which are visual representations of information, data, or knowledge. These questions are usually related to the content of the infographic, such as "What percentage of people like apples but not oranges?" InfographicVQA systems usually use computer vision techniques to analyze the visual elements of the infographic and extract relevant information. The original InfographicVQA data consists of 4,406/500/579 images, containing 23,946/2,801/3,288 questions for training/validation/testing, respectively;

(3)VisualMRC：VisualMRC是一项更一般的任务，涉及回答任何类型的视觉内容的问题，包括照片、图像和视频。这些问题可以是关于视觉内容中可见的任何内容，例如“图片中的汽车是什么颜色？”或“人群中有多少人？”VisualMRC系统通常使用计算机视觉和自然语言处理技术的组合来分析视觉内容并回答问题。原始的VisualMRC数据由9,574/956/2,237张图像组成，分别包含21,015/2,839/6,708个用于训练/验证/测试的问题；(3) VisualMRC: VisualMRC is a more general task that involves answering questions about any type of visual content, including photos, images, and videos. These questions can be about anything visible in the visual content, such as "What color is the car in the picture?" or "How many people are there in the crowd?" VisualMRC systems typically use a combination of computer vision and natural language processing techniques to analyze visual content and answer questions. The original VisualMRC data consists of 9,574/956/2,237 images, containing 21,015/2,839/6,708 questions for training/validation/testing, respectively;

总之，DocVQA和InfographicVQA是更具体的VisualMRC类型，分别侧重于回答有关文档和信息图形的问题，而VisualMRC是一个更一般的任务，可以应用于任何类型的可视内容。In summary, DocVQA and InfographicVQA are more specific types of VisualMRC that focus on answering questions about documents and information graphics, respectively, while VisualMRC is a more general task that can be applied to any type of visual content.

所述视觉语言推理任务文档视觉问题回答侧重于一种特定类型的视觉问题回答任务，其中视觉理解文档图像上的信息是提供答案所必需的。这不仅仅是通过光学字符识别(OCR)传递文档图像，还包括理解文档所传达的所有类型的信息。文档视觉问题是指在文档设计和排版过程中所涉及的视觉元素和设计问题。这些问题可能包括以下几个方面：The visual-linguistic reasoning task Document Visual Question Answering focuses on a specific type of visual question answering task where visual understanding of the information on the document image is necessary to provide an answer. This is not just about delivering document images through optical character recognition (OCR), but also about understanding all types of information conveyed by the document. Document visual questions refer to the visual elements and design issues involved in the document design and typesetting process. These issues may include the following aspects:

(1)布局和排版：这包括如何组织文档中的内容、选择字体和字号、排列段落和标题等。良好的布局和排版可以帮助读者更容易地理解和吸收文档的内容。(1) Layout and typesetting: This includes how to organize the content in the document, choose fonts and font sizes, arrange paragraphs and headings, etc. Good layout and typesetting can help readers understand and absorb the content of the document more easily.

(2)图像和图表：文档中的图像和图表应该被用来支持文本内容，并能够清晰地传达信息。设计人员需要考虑如何选择最合适的图像和图表类型，以及如何将它们融入到文档的整体设计中。(2) Images and charts: Images and charts in a document should be used to support the text content and to clearly convey information. Designers need to consider how to choose the most appropriate types of images and charts and how to integrate them into the overall design of the document.

(3)颜色和字体：颜色和字体可以影响文档的可读性和可视性。设计人员需要选择适合文档主题和目标读者的颜色和字体，以确保文档易于阅读和理解。(3) Color and font: Color and font can affect the readability and visibility of a document. Designers need to choose colors and fonts that are appropriate for the subject of the document and the target audience to ensure that the document is easy to read and understand.

(4)空白和间距：适当的空白和间距可以帮助文档的内容更好地呈现出来，并使文档更易于阅读。设计人员需要考虑如何平衡文本和空白，以及如何使用间距来帮助读者区分不同的文本段落和部分。(4) Whitespace and Spacing: Appropriate whitespace and spacing can help the content of a document stand out better and make the document easier to read. Designers need to consider how to balance text and whitespace, and how to use spacing to help readers distinguish different text paragraphs and sections.

(5)标题和章节：良好的标题和章节结构可以使读者更容易地找到所需的信息，并帮助他们更好地理解文档的组织结构。设计人员需要考虑如何选择最佳标题和章节结构，并确保这些元素与文档的整体设计相协调。(5) Titles and Chapters: A good title and chapter structure can make it easier for readers to find the information they need and help them better understand the organization of the document. Designers need to consider how to choose the best title and chapter structure and ensure that these elements are coordinated with the overall design of the document.

总之，文档视觉问题是文档设计和排版过程中需要考虑的各种视觉元素和设计问题。通过仔细考虑这些问题，设计人员可以创建出易于阅读和理解的高质量文档。In short, document visual issues are various visual elements and design issues that need to be considered in the document design and typesetting process. By carefully considering these issues, designers can create high-quality documents that are easy to read and understand.

所述大语言模型，例如GPT-3，它们的上下文学习场景可以看作是一个条件文本生成问题。具体而言，生成目标文本y的概率取决于输入提取数据中与目标文本相关的文字字段，包含k个示例的上下文C和所要预测前文本x。因此，所要预测前文本x对应的预测目标文本y可以表示为：The context learning scenario of the large language model, such as GPT-3, can be regarded as a conditional text generation problem. Specifically, the probability of generating the target text y depends on the text field related to the target text in the input extraction data, including the context C of k examples and the previous text x to be predicted. Therefore, the predicted target text y corresponding to the previous text x to be predicted can be expressed as:

其中LM表示语言模型的参数，第t个词符(token)y_<t是相对于目前要预测词符y_t(t＝1,2,...,T)之前已预测出的词符，一共可将一段字符串划分成T个(C,x,y_t)的格式，C＝{x₁,y₁,x₂,y₂,...,x_k,y_k}是上下文字符串，这里x_i,y_i(i＝1,2,...,k)分别是第i个上下文字符串中文本格式类似于x,y的前文与后文。在GPT-3中，C是通过连接k个训练实例及其对应的文本来创建的。Where LM represents the parameters of the language model, the tth token y _<t is the token predicted before the current token y _t (t＝1,2,...,T), and a string can be divided into T (C,x,y _t ) formats, C＝{x ₁ ,y ₁ ,x ₂ ,y ₂ ,...,x _k ,y _k } is the context string, where x _i ,y _i (i＝1,2,...,k) are the previous and next texts in the ith context string that are similar to x and y, respectively. In GPT-3, C is created by concatenating k training instances and their corresponding texts.

所述提示，是指使用特定的文本或语言提示来引导模型生成特定的输出。在本发明中将使用特定的文本或语言提示称为“示例”，作为大语言模型的输入，而选择示例方法是采用sentence-transformers算法中all-mpnet-base-v2模型，它将句子和段落映射到一个768维的密集向量空间，计算出与测试样本问题语义余弦相似度最高的检索样本问题所对应的检索样本作为提示设计中的示例样本：The prompt refers to the use of specific text or language prompts to guide the model to generate specific outputs. In the present invention, the use of specific text or language prompts is called "examples" as the input of the large language model, and the method of selecting examples is to use the all-mpnet-base-v2 model in the sentence-transformers algorithm, which maps sentences and paragraphs to a 768-dimensional dense vector space, and calculates the retrieval sample corresponding to the retrieval sample question with the highest semantic cosine similarity to the test sample question as the example sample in the prompt design:

其中A,B均是所要计算的句子，对应着测试样本问题和检索样本问题，cosine_sim(A,B)是指句子A与句子B的语义余弦相似度，dot(A,B)表示句子向量A和B的点积，||A||和||B||分别表示向量A和B的欧氏距离(大小)。后续实验中比较了根据语义相似度选例与随机选例之间的效果差异。A and B are the sentences to be calculated, corresponding to the test sample problem and the retrieval sample problem. cosine_sim(A,B) refers to the semantic cosine similarity between sentence A and sentence B. dot(A,B) represents the dot product of sentence vectors A and B. ||A|| and ||B|| respectively represent the Euclidean distance (size) between vectors A and B. In subsequent experiments, the effect difference between selecting examples based on semantic similarity and randomly selecting examples was compared.

所述文档预训练模型包括文本的预训练模型：BERT和RoBEATA；文本和布局模型LiLT；文本、布局和图像模型预训练模型：LayoutLM，LayoutLMv2，LayoutLMv3，ERNIELayout。基准实验分别评估对下游任务微调的预训练视觉文档理解模型在全样本到少样本的稳健性。The document pre-training models include text pre-training models: BERT and RoBEATA; text and layout model LiLT; text, layout and image model pre-training models: LayoutLM, LayoutLMv2, LayoutLMv3, ERNIELayout. Benchmark experiments evaluate the robustness of pre-trained visual document understanding models fine-tuned for downstream tasks from full samples to few samples.

本发明提出的一种基于布局感知提示的文档视觉语言推理方法，本发明用于研究现有大语言模型GPT-3在提示学习的引导下解决文档视觉语言多模态推理任务(步骤4)，以及在少样本情况下，对于如何利用上下文示例学习推理过程，同时分析现有的文档预训练模型在训练样本处于少样本时鲁棒能力的脆弱性。包括如下步骤：The present invention proposes a document visual language reasoning method based on layout-aware cues. The present invention is used to study how the existing large language model GPT-3 solves the document visual language multimodal reasoning task (step 4) under the guidance of cue learning, and how to use contextual examples to learn the reasoning process in the case of few samples, and analyzes the fragility of the robustness of the existing document pre-training model when the training samples are in the case of few samples. It includes the following steps:

步骤1：数据预处理，选取三个数据集进行实验，包括DocVQA、InfographicVQA和VisualMRC，通过光学字符识别(OCR)算法，对于这3个数据集中的任意一个，首先对数据集中的文档图像进行预处理操作，如去噪、二值化、旋转校正、倾斜校正等，以提高后续处理的精度和效率，接着从文档图像中提取出字符的特征信息，如轮廓、边缘、投影等，用于进行字符识别，字符识别是将特征信息与已训练好的光学字符识别(OCR)模型进行匹配，以确定文档图像中的字符，最后将识别结果输出为计算机可编辑和处理的文本格式，如TXT、JSON等。对3个数据集均执行上述操作，由此分别提取三个数据集中数据中的文本信息，布局信息与数据集中关于问题、问题编号、答案等信息整理为对应的JSON格式文件的数据文本，将所述数据文本按照预设比例分为检索数据集和测试数据集；Step 1: Data preprocessing. Three data sets are selected for experiments, including DocVQA, InfographicVQA and VisualMRC. Through the optical character recognition (OCR) algorithm, for any of the three data sets, the document images in the data set are first preprocessed, such as denoising, binarization, rotation correction, tilt correction, etc., to improve the accuracy and efficiency of subsequent processing. Then, the feature information of the characters, such as contours, edges, projections, etc., are extracted from the document images for character recognition. Character recognition is to match the feature information with the trained optical character recognition (OCR) model to determine the characters in the document image. Finally, the recognition results are output as text formats that can be edited and processed by computers, such as TXT, JSON, etc. The above operations are performed on all three data sets, thereby extracting the text information in the data of the three data sets, and the layout information and the information about questions, question numbers, answers, etc. in the data sets are sorted into data texts in the corresponding JSON format files, and the data texts are divided into retrieval data sets and test data sets according to a preset ratio;

步骤2：选取示例样本，示例样本用于帮助大语言模型理解任务所需要数据格式以及问答形式。通过步骤1中获取到的JSON格式文件的数据文本，将检索数据集中所有检索样本的问题组成集合A，提取出测试数据集中任意一个测试样本的问题，然后通过sentence-transformers算法中all-mpnet-base-v2模型在集合A中检索，分别计算该测试样本的问题与集合A中每个问题的语义余弦相似度，从而检索出与该测试样本的问题语义相似最高的检索样本的问题，该检索样本的问题对应的检索样本作为提示设计中的示例样本，用于步骤3的提示设计，其中，所述all-mpnet-base-v2模型将句子和段落映射到一个768维的密集向量空间；Step 2: Select sample samples. The sample samples are used to help the large language model understand the data format and question-answer format required for the task. Through the data text of the JSON format file obtained in step 1, the questions of all retrieval samples in the retrieval dataset are grouped into set A, and the questions of any test sample in the test dataset are extracted. Then, the all-mpnet-base-v2 model in the sentence-transformers algorithm is used to search in set A, and the semantic cosine similarity between the question of the test sample and each question in set A is calculated respectively, so as to retrieve the question of the retrieval sample with the highest semantic similarity to the question of the test sample. The retrieval sample corresponding to the question of the retrieval sample is used as the sample sample in the prompt design for the prompt design of step 3, wherein the all-mpnet-base-v2 model maps sentences and paragraphs to a 768-dimensional dense vector space;

步骤3：设计提示(prompting)，提示是指使用特定的文本或语言提示来引导模型生成特定的输出。这里有3种提示：纯文本提示、文本与布局离散提示和布局感知提示。(1)纯文本提示，即只有文档数据的文本信息没有布局信息；(2)文本与布局离散提示是将文本与布局分别加入提示中并设计提示头部告知模型文本数据和布局数据的格式，以及二者之间简单的对应关系；(3)布局感知提示是将步骤2中获取的示例样本和测试样本的数据整理到提示中，将提示设计为(提示头部，上下文样例，测试样例)的数据流格式，其中提示头部(prompting head)的作用是告知GPT-3上下文样例(in-context demonstration)和测试样例(testing demonstration)的数据格式是如何，并提示根据文本信息以及布局信息来回答问题，具体格式是“数据形式为{text:boxes}(即{文本:对应的OCR框})，其中每个文本对应的OCR框由四个坐标定义：[x1,y1,x2,y2]。其中x1和y1分别指OCR框左上角的横纵坐标，x2和y2分别指OCR框右下角的横纵坐标，x1,y1,x2,y2用于表示OCR框在文档中的位置，请根据上述数据形式回答问题”，即("The data form is{text:boxes},where each boxes isdefined by four coordinates:[[x1,y1,x2,y2]].The x1 and y1 refer to thehorizontal and vertical coordinates of the upper-left corner of the OCRboxes,and the x2 and y2 refer to the horizontal and vertical coordinates ofthe lower-right corner of the OCR boxes which indicate the position of theOCR boxes within the document,please answer the question according to theabove data form.")。将步骤1中提取出的文本信息和布局信息设计成映射关系{文本:对应的OCR框}，即{text:boxes}，作为上下文样例和测试样例的数据格式，这种格式称之为“布局感知提示”。其中上下文样例就是步骤2中根据与测试样例问题语义最为相近所选出的示例数据来引导大语言模型，例如GPT-3，理解所要处理的是问答推理任务数据样式，上下文样例包含上下文布局感知提示、上下文样例问题和上下文样例答案；测试样例包含测试样例布局感知提示和测试样例问题。最后，通过实验比较纯文本提示、文本与布局离散提示和布局感知提示这三种提示方式的测试结果；Step 3: Design prompts. Prompts refer to the use of specific text or language prompts to guide the model to generate specific outputs. There are three types of prompts: pure text prompts, text and layout discrete prompts, and layout-aware prompts. (1) Pure text prompts, that is, only the text information of the document data without layout information; (2) Text and layout discrete prompts are to add text and layout to the prompts respectively and design a prompt header to inform the model of the format of text data and layout data, as well as a simple correspondence between the two; (3) Layout-aware prompts are to organize the data of the example samples and test samples obtained in step 2 into the prompts, and design the prompts as a data stream format of (prompt header, context samples, test samples), where the role of the prompting head is to inform GPT-3 of context samples (in-context demonstration) and test samples (testing The data form is {text:boxes},where each boxes is defined by four coordinates:[[x1,y1,x2,y2]].The x1 and y1 refer to the horizontal and vertical coordinates of the upper-left corner of the OCR boxes,and the x2 and y2 refer to the horizontal and vertical coordinates ofthe lower-right corner of the OCR boxes which indicate the position of the OCR boxes within the document,please answer the questions according to the data form. answer the question according to the above data form."). The text information and layout information extracted in step 1 are designed into a mapping relationship {text: corresponding OCR box}, that is, {text: boxes}, as the data format of context samples and test samples. This format is called "layout-aware prompts". The context samples are the sample data selected in step 2 based on the semantics closest to the test sample question to guide the large language model, such as GPT-3, to understand that the data style to be processed is the question-answering reasoning task data style. The context samples include context layout-aware prompts, context sample questions, and context sample answers; the test samples include test sample layout-aware prompts and test sample questions. Finally, the test results of the three prompt methods, namely, pure text prompts, text and layout discrete prompts, and layout-aware prompts, are compared through experiments;

步骤4：将设计好的提示传递给GPT-3来进行少样本文档视觉问答推理任务，使用平均归一化莱文斯坦距离(Average Normalized Levenshtein distance)准则来评估生成答案的准确度：Step 4: Pass the designed prompt to GPT-3 for the few-shot document visual question answering reasoning task, and use the Average Normalized Levenshtein distance criterion to evaluate the accuracy of the generated answer:

其中lev_a,b(i,j)表示字符串a的前i个字符与字符串b的前j个字符之间的莱文斯坦距离；1_(ai≠bj)是一个指示函数，当a_i＝b_j时，其值为0，其他时候它等于1，a_i表示字符串a的第i个字符，b_j表示字符串b的第j个字符。min运算中的第一个公式lev_a,b(i-1,j)+1代表从字符串a中删除字符以到达b；第二个公式lev_a,b(i,j-1)+1代表从字符串a中插入字符以到达b；第三个公式lev_a,b(i-1,j-1)+1_(ai≠bj)代表从字符串a中替换字符以到达b(取决于当前字符是否相同)。在语言学中，莱文斯坦距离被用作量化文本距离的度量，即两个文本之间的差异。它与相互可理解性有关：文本距离越高，相互可理解度越低，文本距离越低，相互可理解度越高。后续实验中比较随机选取示例样例与语义相近不同情况；min，max分别表示取最小和最大值操作；Where lev _a,b (i,j) represents the Levenshtein distance between the first i characters of string a and the first j characters of string b; 1 _(ai≠bj) is an indicator function, which is 0 when a _i = b _j and 1 otherwise, a _i represents the i-th character of string a and b _j represents the j-th character of string b. The first formula in the min operation lev _a,b (i-1,j)+1 represents deleting characters from string a to reach b; the second formula lev _a,b (i,j-1)+1 represents inserting characters from string a to reach b; the third formula lev _a,b (i-1,j-1)+1 _(ai≠bj) represents replacing characters from string a to reach b (depending on whether the current characters are the same). In linguistics, the Levenshtein distance is used as a measure to quantify text distance, that is, the difference between two texts. It is related to mutual intelligibility: the higher the text distance, the lower the mutual intelligibility, and the lower the text distance, the higher the mutual intelligibility. In subsequent experiments, we compared the random selection of examples with different semantically similar cases; min and max represent the minimum and maximum value operations, respectively;

步骤5：研究不同模型在少样本(few-shot)的情况下的鲁棒性。采用不同的文档预训练模型，所述文档预训练模型包括文本的预训练模型：BERT和RoBEATA；文本和布局模型LiLT；文本、布局和图像模型预训练模型：LayoutLM，LayoutLMv2，LayoutLMv3，ERNIELayout。对于3种不同的文档视觉问答数据集所设置参数相同：将学习率设置为2×e^-5，训练epoch设置为40，所有输入图像的分辨率均为224x 224像素。全样本训练时训练中的batch设置为4，测试中batch设置为1，少样本训练时训练中的batch设置为1，测试中batch设置为1，比较这些模型在少样本的情况下的推理效果。此外，探究在少样本的情况下不同数量的示例样本，以及不同示例样本中不同问题数量的情况。Step 5: Study the robustness of different models in the case of few-shot. Different document pre-training models are used, including text pre-training models: BERT and RoBEATA; text and layout model LiLT; text, layout and image model pre-training models: LayoutLM, LayoutLMv2, LayoutLMv3, ERNIELayout. The parameters set for the three different document visual question answering datasets are the same: the learning rate is set to 2×e ^-5 , the training epoch is set to 40, and the resolution of all input images is 224x 224 pixels. In full-sample training, the batch in training is set to 4, and the batch in testing is set to 1. In few-shot training, the batch in training is set to 1, and the batch in testing is set to 1. The reasoning effects of these models in the case of few-shot are compared. In addition, the different numbers of example samples in the case of few-shot and the different numbers of questions in different example samples are explored.

与现有技术相比，本发明具有如下优点和有益效果：Compared with the prior art, the present invention has the following advantages and beneficial effects:

(1)之前的多模态模型来解决文档视觉问答任务都是以预训练-微调的范式来解决，但是这样的做法非常消耗时间，而且对机器的配置要求也非常的高，为了解决这些问题，通过提示学习来引导大语言模型来做推理任务，这种方式简便，一般情况下需要3-6小时就能评估完，而预训练-微调的范式往往需要多张A100 GPU才可能达到数个小时内完成评估，且在具体任务的推理准确度也比最佳预训练-微调范式的模型推理准确度高；(1) Previous multimodal models used to solve document visual question answering tasks all adopted the pre-training-fine-tuning paradigm. However, this approach is very time-consuming and has very high requirements for machine configuration. In order to solve these problems, we use prompt learning to guide large language models to perform reasoning tasks. This method is simple and generally takes 3-6 hours to complete the evaluation. The pre-training-fine-tuning paradigm often requires multiple A100 GPUs to complete the evaluation within several hours. In addition, the reasoning accuracy of specific tasks is higher than that of the best pre-training-fine-tuning paradigm model.

(2)通过提示学习引入布局信息，引导大型语言模型能够理解问题中的文本与视觉内容之间的关系，并使用该信息改善上下文学习生成答案，让单模态大语言模型也能处理多模态文档视觉问答任务，准确率相较于普通的方式回答更高；(2) By introducing layout information through prompt learning, the large language model is guided to understand the relationship between the text and visual content in the question, and use this information to improve context learning to generate answers, so that the single-modal large language model can also handle multi-modal document visual question answering tasks with higher accuracy than ordinary answering methods;

(3)在不同的文档视觉问答数据集以少样本的情况评估和比较多模态模型与Layout-Aware Prompting，希望提出的提示学习引导大语言模型方式、实证研究和深入分析将有利于未来的研究，以提高模型回答文档视觉推理任务的准确度。(3) We evaluate and compare the multimodal model with Layout-Aware Prompting in a few-shot scenario on different document visual question answering datasets. We hope that the proposed prompt learning-guided large language model approach, empirical research, and in-depth analysis will be beneficial to future research to improve the accuracy of models in answering document visual reasoning tasks.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为文档图像的样例图；FIG1 is a sample diagram of a document image;

图2为本发明的实现方法的流程图；FIG2 is a flow chart of an implementation method of the present invention;

图3为本发明的设计提示学习方法示意图；FIG3 is a schematic diagram of a design prompt learning method of the present invention;

图4为本发明的示例选择方法示意图。FIG. 4 is a schematic diagram of an exemplary selection method of the present invention.

具体实施方式DETAILED DESCRIPTION

为使本发明的目的、技术方案和优点更加清楚，下面结合实施方式和附图，对本发明作进一步地详细描述，以便相关领域的技术人员能更好地理解本发明。需要特别注意的是，所描述的实施例是本发明一部分实施例，而不是全部的实施例，也非旨在限制要求保护的本发明的范围。本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其它实施例，都属于本发明保护的范围。In order to make the purpose, technical scheme and advantages of the present invention clearer, the present invention is further described in detail below in combination with the implementation mode and the accompanying drawings, so that those skilled in the relevant art can better understand the present invention. It should be noted that the described embodiments are part of the embodiments of the present invention, not all of the embodiments, and are not intended to limit the scope of the invention claimed for protection. All other embodiments obtained by ordinary technicians in this field without creative work are within the scope of protection of the present invention.

考虑到现有的微调后的文档预训练模型来解决文档视觉问答任务都是以预训练-微调的范式来解决，但是这样的做法非常消耗时间，而且对机器的配置要求也非常的高，而且这类方法在处理少样本数据(few-shot)时效果比较差。本发明提出了一种基于布局感知提示的文档视觉语言推理方法，用于引导大语言模型借助强大的上下文学习能力，理解数据所要传递信息，用于解决文档视觉语言推理任务。引导大型语言模型能够理解问题中的文本与视觉内容之间的关系，并使用该信息改善上下文学习生成答案。帮助大型语言模型在少样本学习(few-shot)上达到理想的效果，并在3种不同的文档视觉问答数据集测试方法的泛化性。Considering that the existing fine-tuned document pre-training models to solve the document visual question answering task are all solved in the pre-training-fine-tuning paradigm, but this approach is very time-consuming and has very high requirements for machine configuration, and this type of method is relatively poor when processing few-shot data. The present invention proposes a document visual language reasoning method based on layout-aware cues, which is used to guide the large language model to use powerful contextual learning capabilities to understand the information to be conveyed by the data, and is used to solve the document visual language reasoning task. Guide the large language model to understand the relationship between the text and the visual content in the question, and use this information to improve contextual learning to generate answers. Help large language models achieve ideal results in few-shot learning, and test the generalization of the method on three different document visual question answering datasets.

所述文档视觉问答数据集分别是DocVQA(文档视觉问答)、InfographicVQA(信息图表视觉问答)和VisualMRC(视觉机器阅读理解)，这三种数据集都与回答有关视觉内容的问题有关。然而，它们之间有一些重要的区别：The document visual question answering datasets are DocVQA (Document Visual Question Answering), InfographicVQA (Infographic Visual Question Answering), and VisualMRC (Visual Machine Reading Comprehension). All three datasets are related to answering questions about visual content. However, there are some important differences between them:

所述视觉语言推理任务文档视觉问题回答(DocVQA)侧重于一种特定类型的视觉问题回答任务，其中视觉理解文档图像上的信息是提供答案所必需的。这不仅仅是通过OCR传递文档图像，还包括理解文档所传达的所有类型的信息。文字内容(手写或打字)、非文字元素(标记、勾号、分隔符、图表)、布局(页面结构、表单、表格)和样式(字体、颜色、突出显示)等等，这些都是回答手头问题时可能需要的信息。文档视觉问题是指在文档设计和排版过程中所涉及的视觉元素和设计问题。这些问题可能包括以下几个方面：The visual-linguistic reasoning task Document Visual Question Answering (DocVQA) focuses on a specific type of visual question answering task, where visual understanding of the information on the document image is necessary to provide the answer. This is not just about passing the document image through OCR, but also about understanding all types of information conveyed by the document. Text content (handwritten or typed), non-text elements (marks, ticks, separators, charts), layout (page structure, forms, tables), and style (fonts, colors, highlights), etc., are all information that may be needed to answer the question at hand. Document visual questions refer to the visual elements and design issues involved in the document design and typesetting process. These issues may include the following aspects:

所述大语言模型，例如GPT3，它们的上下文学习场景可以看作是一个条件文本生成问题。具体而言，生成目标文本y的概率取决于输入提取数据中与目标文本相关的文字字段，包含k个示例的上下文C和所要预测前文本x。因此，所要预测前文本x对应的预测目标文本y可以表示为：The context learning scenario of the large language model, such as GPT3, can be regarded as a conditional text generation problem. Specifically, the probability of generating the target text y depends on the text field related to the target text in the input extraction data, including the context C of k examples and the previous text x to be predicted. Therefore, the predicted target text y corresponding to the previous text x to be predicted can be expressed as:

其中LM表示语言模型的参数，第t个词符(token)是相对于目前要预测词符(t＝1,2,...,T)之前已预测出的词符，一共可将一段字符串划分成T个(C,x,y)的格式，C＝{x₁,y₁,x₂,y₂,...,x_k,y_k}是上下文字符串，这里x_i,y_i(i＝1,2,...,k)分别是第i个上下文字符串中文本格式类似于x,y的前文与后文。在GPT-3中，C是通过连接k个训练实例及其对应的文本来创建的。Where LM represents the parameters of the language model, the tth token is the token predicted before the current token to be predicted (t＝1,2,...,T), a string can be divided into T (C,x,y) formats, C＝{x ₁ ,y ₁ ,x ₂ ,y ₂ ,...,x _k ,y _k } is the context string, where x _i ,y _i (i＝1,2,...,k) are the previous and next texts in the ith context string with text formats similar to x and y, respectively. In GPT-3, C is created by concatenating k training instances and their corresponding texts.

所述提示，是指使用特定的文本或语言提示来引导模型生成特定的输出。在本发明中将使用特定的文本或语言提示称为“示例”，作为大语言模型的输入，而选择示例方法是采用sentence-transformers算法中all-mpnet-base-v2模型，它将句子和段落映射到一个768维的密集向量空间，计算出与测试样本问题语义余弦相似度最高的检索样本问题所对应的检索样本作为提示设计中的示例样本：The prompt refers to the use of specific text or language prompts to guide the model to generate specific outputs. In the present invention, the use of specific text or language prompts is called "examples" as the input of the large language model, and the method of selecting examples is to use the all-mpnet-base-v2 model in the sentence-transformers algorithm, which maps sentences and paragraphs to a 768-dimensional dense vector space, and calculates the retrieval sample corresponding to the retrieval sample question with the highest semantic cosine similarity with the test sample question as the example sample in the prompt design:

其中A,B均是所要计算的句子，对应着测试样本问题和检索样本问题，cosine_sim(A,B)是指句子A与句子B的语义余弦相似度，dot(A,B)表示句子向量A和B的点积，||A||和||B||分别表示向量A和B的欧氏距离(大小)。后续消融实验中比较了根据语义相似度选例与随机选例之间的效果差异。A and B are the sentences to be calculated, corresponding to the test sample problem and the retrieval sample problem. cosine_sim(A,B) refers to the semantic cosine similarity between sentence A and sentence B. dot(A,B) represents the dot product of sentence vectors A and B. ||A|| and ||B|| respectively represent the Euclidean distance (size) between vectors A and B. The subsequent ablation experiment compared the effect of selecting examples based on semantic similarity and randomly selecting examples.

步骤1：数据预处理，选取三个数据集进行实验，包括DocVQA、InfographicVQA和VisualMRC，通过光学字符识别(OCR)算法，首先对于这3个数据集中的任意一个的文档图像进行预处理操作，如去噪、二值化、旋转校正、倾斜校正等，以提高后续处理的精度和效率，接着从文档图像中提取出字符的特征信息，如轮廓、边缘、投影等，用于进行字符识别，字符识别是将特征信息与实现训练好的光学字符识别(OCR)模型进行匹配，以确定图像中的字符，最后将识别结果输出为计算机可编辑和处理的文本格式，如TXT、JSON等。通过上述操作提取三个数据集中数据中的文本信息，布局信息与数据集中关于问题、问题编号、答案等信息整理到对应的JSON格式文件的数据文本，将各数据集得到的数据文本分为检索数据集和测试数据集；Step 1: Data preprocessing. Three data sets are selected for experiments, including DocVQA, InfographicVQA and VisualMRC. Through the optical character recognition (OCR) algorithm, the document image of any one of the three data sets is first preprocessed, such as denoising, binarization, rotation correction, tilt correction, etc., to improve the accuracy and efficiency of subsequent processing. Then, the feature information of the characters, such as contours, edges, projections, etc., are extracted from the document image for character recognition. Character recognition is to match the feature information with the trained optical character recognition (OCR) model to determine the characters in the image. Finally, the recognition results are output as text formats that can be edited and processed by computers, such as TXT, JSON, etc. Through the above operations, the text information in the data of the three data sets is extracted, and the layout information and the information about questions, question numbers, answers, etc. in the data sets are sorted into the data text of the corresponding JSON format files, and the data text obtained from each data set is divided into a retrieval data set and a test data set;

步骤2：选取示例样本(样例)，示例样本用于帮助大语言模型理解任务所需要数据格式以及问答形式。通过步骤1中获取到的JSON格式文件的数据文本，将检索数据集中所有检索样本的问题组成集合A，提取出测试数据集中任意一个测试样本的问题，然后通过sentence-transformers算法中all-mpnet-base-v2模型在集合A中检索，分别计算该测试样本的问题与集合A中每个问题的语义余弦相似度，从而检索出与该测试样本的问题语义相似最高的检索样本的问题，该检索样本的问题对应的检索样本作为提示设计中的示例样本，用于步骤3的提示设计，其中，所述all-mpnet-base-v2模型将句子和段落映射到一个768维的密集向量空间。不同的选例方式，Layout-Aware Prompting所表现出的效果也有所影响，表1显示了在DocVQA中使用语义相似选取样例比随机选取样例的情况最好，其中ANLS↑表示平均归一化莱文斯坦距离，文本距离越高，相互可理解度越低，文本距离越低，相互可理解度越高，所以ANLS值越高越好；Step 2: Select sample samples (examples), which are used to help the large language model understand the data format and question-answer format required for the task. Through the data text of the JSON format file obtained in step 1, the questions of all retrieval samples in the retrieval dataset are grouped into set A, and the questions of any test sample in the test dataset are extracted. Then, the all-mpnet-base-v2 model in the sentence-transformers algorithm is used to search in set A, and the semantic cosine similarity between the question of the test sample and each question in set A is calculated respectively, so as to retrieve the question of the retrieval sample with the highest semantic similarity to the question of the test sample. The retrieval sample corresponding to the question of the retrieval sample is used as the sample sample in the prompt design for the prompt design of step 3, wherein the all-mpnet-base-v2 model maps sentences and paragraphs to a 768-dimensional dense vector space. Different example selection methods also affect the effect of Layout-Aware Prompting. Table 1 shows that in DocVQA, using semantically similar examples to select examples is the best compared to randomly selecting examples. ANLS↑ represents the average normalized Levenshtein distance. The higher the text distance, the lower the mutual intelligibility. The lower the text distance, the higher the mutual intelligibility. Therefore, the higher the ANLS value, the better.

表1不同选例方式的测试结果Table 1 Test results of different selection methods

步骤3：设计提示(prompting)，提示是指使用特定的文本或语言提示来引导模型生成特定的输出。这里有3种提示：纯文本提示、文本与布局离散提示和布局感知提示。(1)纯文本提示，即只将文档数据的文本信息而没有加入布局信息；(2)文本与布局离散提示是将文本与布局分别加入提示中并设计提示头部告知模型文本数据和布局数据的格式，以及二者之间简单的对应关系；(3)布局感知提示是将步骤2中获取的示例样本和测试样本的数据处整理到提示中，将提示设计为(提示头部，上文样例，测试样例)的数据流格式，其中提示头部(prompting head)的作用是告知GPT-3上文样例(in-context demonstration)和测试样例(testing demonstration)的数据格式是如何，并提示根据文本信息以及布局信息来回答问题，具体格式是“数据形式为{text:boxes}({文本:对应的OCR框})，其中每个框由四个坐标定义：[[x1,y1,x2,y2]]。其中x1和y1是指OCR框左上角的横纵坐标，x2和y2是指OCR框右下角角的横纵坐标，表示OCR框在文档中的位置，请根据上述数据形式回答问题”，即("The data form is{text:boxes},where each boxes is defined by fourcoordinates:[[x1,y1,x2,y2]].The x1 and y1 refer to the horizontal andvertical coordinates of the upper-left corner of the OCR boxes,and the x2 andy2 refer to the horizontal and vertical coordinates of the lower-right cornerof the OCR boxes which indicate the position of the OCR boxes within thedocument,please answer the question according to the above data form.")。然后将步骤1中提取出的文本信息和布局信息设计成映射关系{文本:对应的OCR框}，即{text:boxes}，作为上下文样例和测试样例的数据格式，这种格式称之为“布局感知提示”。其中上下文样例就是步骤2中根据与测试样例问题语义最为相近所选出的示例数据来引导GPT-3理解所要处理的是问答推理任务数据样式，上下文样例包含上下文布局感知提示、上下文样例问题和上下文样例答案；测试样例包含测试样例布局感知提示和测试样例问题。表2显示了是否将文本信息与布局信息设计成映射关系的不同效果，证明了将文本信息与布局信息设计成映射关系对于引导GPT-3的效果更好；表2展示了prompting有无布局信息的效果以及将文本信息与布局信息设计成映射关系和不设计成映射关系的两种情况下，模型在不同数据集和不同的样本情况下的平均准确率。结果显示，无布局信息的效果比较差以及将文本信息与布局信息设计成映射关系对于引导GPT-3的效果更好。具体来说，相比于不设计成映射关系的情况，设计成映射关系可以显著提高模型进行推理回答问题的准确度。这表明通过将文本信息与布局信息设计成映射关系，可以更好地利用布局信息引导模型进行预测，从而提高模型进行推理回答问题的准确度。Step 3: Design prompts. Prompts refer to the use of specific text or language prompts to guide the model to generate specific outputs. There are three types of prompts: plain text prompts, text and layout discrete prompts, and layout-aware prompts. (1) Plain text prompts, that is, only the text information of the document data is added without adding layout information; (2) Text and layout discrete prompts are to add text and layout to the prompts separately and design a prompt header to inform the model of the format of text data and layout data, as well as a simple correspondence between the two; (3) Layout-aware prompts are to organize the data of the example samples and test samples obtained in step 2 into the prompts, and design the prompts in the data stream format of (prompt header, previous examples, test examples), where the role of the prompting head is to inform GPT-3 of the previous examples (in-context demonstration) and test samples (testing The data form is {text:boxes},where each box is defined by four coordinates:[[x1,y1,x2,y2]].The x1 and y1 refer to the horizontal and vertical coordinates of the upper-left corner of the OCR boxes,and the x2 and y2 refer to the horizontal and vertical coordinates of the lower-right cornerof the OCR boxes which indicate the position of the OCR boxes within the document,please answer the question according to the above data form."). Then, the text information and layout information extracted in step 1 are designed into a mapping relationship {text: corresponding OCR box}, that is, {text: boxes}, as the data format of context samples and test samples. This format is called "layout-aware prompts". The context samples are the sample data selected in step 2 based on the semantics closest to the test sample questions to guide GPT-3 to understand the question-answering reasoning task data style to be processed. The context samples include context layout-aware prompts, context sample questions, and context sample answers; the test samples include test sample layout-aware prompts and test sample questions. Table 2 shows the different effects of whether or not to design the text information and layout information into a mapping relationship, proving that designing the text information and layout information into a mapping relationship is better for guiding GPT-3; Table 2 shows the effect of prompting with or without layout information, as well as the average accuracy of the model in different data sets and different sample situations when the text information and layout information are designed into a mapping relationship and when they are not designed into a mapping relationship. The results show that the effect of no layout information is relatively poor and the effect of designing the text information and layout information into a mapping relationship is better for guiding GPT-3. Specifically, compared with the case where no mapping relationship is designed, designing a mapping relationship can significantly improve the accuracy of the model in reasoning and answering questions. This shows that by designing a mapping relationship between text information and layout information, the layout information can be better used to guide the model to make predictions, thereby improving the accuracy of the model in reasoning and answering questions.

表2提示样例不同格式的测试结果Table 2 shows the test results of different formats of the sample

注：w/o boxes表示没有位置信息，w/boxes表示有位置信息，split表示文本信息与其位置信息是离散的，mapping表示文本信息与其位置信息是映射关系，ANLS↑表示平均归一化莱文斯坦距离，值越高表示效果越好Note: w/o boxes means no location information, w/boxes means location information, split means the text information and its location information are discrete, mapping means the text information and its location information are mapped, ANLS↑ means the average normalized Levenshtein distance, the higher the value, the better the effect

其中lev_a,b(i,j)表示字符串a的前i个字符与字符串b的前j个字符之间的列文斯坦距离；是一个指示函数，当a_i＝b_j时，其值为0，其他时候它等于1，a_i表示字符串a的第i个字符，b_j表示字符串b的第j个字符。min运算中的第一个公式lev_a,b(i-1,j)+1代表从字符串a中删除字符以到达b；第二个公式j_a,b(i,j-1)+1代表从字符串a中插入字符以到达b；第三个公式代表从字符串a中替换字符以到达b(取决于当前字符是否相同)。min，max分别表示取最小和最大值操作。在语言学中，莱文斯坦距离被用作量化文本距离的度量，即两个文本之间的差异。它与相互可理解性有关：文本距离越高，相互可理解度越低，文本距离越低，相互可理解度越高。不同数量的样本以及样本中问题数量会影响最后的效果，但不是越多越好，表3展示了不同数量的样本和样本中问题数量情况下，本发明方法的平均准确率。结果显示，在DocVQA数据集中，最好的情况是1个样本(1-shot)情况下1个文本内容和4个问题的情况。这可能是由于示例样本中的问题与测试样本中的问题在语义上较为相近，说明步骤2在设计上下文学习示例时起到了比较重要的作用。因为步骤2的设计是为了将示例样本的上下文信息与问题紧密联系起来，从而使模型能够更好地进行推理和泛化。因此，这些结果表明，在少样本学习中，合理设计示例样本的上下文信息可以显著提高模型理解样例进行推理的效果，进而提高回答问题的准确度；Where lev _a,b (i,j) represents the Levenshtein distance between the first i characters of string a and the first j characters of string b; is an indicator function. When a _i = b _j , its value is 0, and it is equal to 1 otherwise. a _i represents the i-th character of string a, and b _j represents the j-th character of string b. The first formula in the min operation lev _a,b (i-1,j)+1 represents deleting characters from string a to reach b; the second formula j _a,b (i,j-1)+1 represents inserting characters from string a to reach b; the third formula Represents replacing characters from string a to reach b (depending on whether the current characters are the same). min and max represent the minimum and maximum operations, respectively. In linguistics, the Levenshtein distance is used as a measure to quantify text distance, that is, the difference between two texts. It is related to mutual comprehensibility: the higher the text distance, the lower the mutual comprehensibility, and the lower the text distance, the higher the mutual comprehensibility. Different numbers of samples and the number of questions in the sample will affect the final effect, but the more the better. Table 3 shows the average accuracy of the method of the present invention under different numbers of samples and the number of questions in the sample. The results show that in the DocVQA dataset, the best case is 1 text content and 4 questions in 1 sample (1-shot). This may be because the questions in the example sample are semantically similar to those in the test sample, indicating that step 2 plays a relatively important role in designing contextual learning examples. Because step 2 is designed to closely link the context information of the example sample with the question, so that the model can better reason and generalize. Therefore, these results show that in few-shot learning, properly designing the contextual information of example samples can significantly improve the model's ability to understand the examples and perform reasoning, thereby improving the accuracy of answering questions;

表3在文档视觉问答数据集上不同样本数量和样本中问题数量的测试结果Table 3 Test results of different sample numbers and number of questions in samples on the document visual question answering dataset

步骤5：研究不同模型在少样本(few-shot)的情况下的鲁棒性。采用不同的文档预训练模型，文档预训练模型包括文本的预训练模型：BERT和RoBEATA；文本和布局模型LiLT；文本、布局和图像模型预训练模型：LayoutLMv1，LayoutLMv2，LayoutLMv3，ERNIELayout。比较这些模型在少样本的情况下的推理效果。对于文档视觉问答任务，3种不同的文档视觉问答数据集所设置参数相同有：学习率设置为2×e^-5，训练epoch设置为40，所有输入图像的分辨率均为224x 224像素。全样本训练时训练中的batch设置为4，测试中batch设置为1，少样本训练时训练中的batch设置为1，测试中batch设置为1，表4展示了在全样本情况下以及在少样本的情况下比较不同模型的推理效果。结果显示，现有的文档预训练模型从全样本到少样本的效果变化非常大，说明它们在少样本的情况下泛化性比较差。相比之下，Layout-aware Prompting(表4中的ours)在少样本的情况下的效果比较好，但距离现有的文档预训练模型在全样本情况下的效果还存在着一定的差距。这可能是由于单模态大语言模型对于视觉的感知效果还不如现有的文档预训练模型中设置的视觉模块的效果好所致。因此，这些结果表明在处理少样本数据时，现有的文档预训练模型可能存在一定的局限性，而Layout-aware Prompting可以是一种有效的解决方案，但仍需进一步改进来提高其效果。Step 5: Study the robustness of different models in the case of few-shot. Different document pre-training models are used, including text pre-training models: BERT and RoBEATA; text and layout model LiLT; text, layout and image model pre-training models: LayoutLMv1, LayoutLMv2, LayoutLMv3, ERNIELayout. Compare the reasoning effects of these models in the case of few-shot. For the document visual question answering task, the parameters set for the three different document visual question answering datasets are the same: the learning rate is set to 2×e ^-5 , the training epoch is set to 40, and the resolution of all input images is 224x 224 pixels. In full-shot training, the batch size in training is set to 4, and the batch size in testing is set to 1. In few-shot training, the batch size in training is set to 1, and the batch size in testing is set to 1. Table 4 shows the reasoning effects of different models in the case of full-shot and few-shot. The results show that the effects of existing document pre-training models vary greatly from full-shot to few-shot, indicating that they have poor generalization in the case of few-shot. In contrast, Layout-aware Prompting (ours in Table 4) performs better in the case of few samples, but there is still a certain gap from the performance of existing document pre-training models in the case of full samples. This may be because the visual perception effect of the unimodal large language model is not as good as the effect of the visual module set in the existing document pre-training model. Therefore, these results show that the existing document pre-training model may have certain limitations when processing few-sample data, and Layout-aware Prompting can be an effective solution, but further improvements are still needed to improve its effect.

表4基准实验结果Table 4 Benchmark experiment results

注：ours表示本发明方法，Full-sample代表全样本情况，Few-shot表示少样本情况，ANLS↑表示平均归一化莱文斯坦距离，值越高表示效果越好Note: ours represents the method of the present invention, Full-sample represents the full sample situation, Few-shot represents the few-shot situation, ANLS↑ represents the average normalized Levenshtein distance, the higher the value, the better the effect

以上所述，仅为本发明的具体实施方式，本说明书中所公开的任一特征，除非特别叙述，均可被其他等效或具有类似目的的替代特征加以替换；所公开的所有特征、或所有方法或过程中的步骤，除了互相排斥的特征和/或步骤以外，均可以任何方式组合。The above description is only a specific implementation mode of the present invention. Any feature disclosed in this specification, unless otherwise stated, can be replaced by other alternative features that are equivalent or have similar purposes; all the disclosed features, or all the steps in the methods or processes, except for mutually exclusive features and/or steps, can be combined in any way.

Claims

1. A document visual language reasoning method based on layout perception prompt is characterized by comprising the following steps:

step 1: preprocessing data;

acquiring a data set, performing preprocessing operation on a document image in the data set through an optical character recognition algorithm to improve the accuracy and efficiency of subsequent processing, extracting character characteristic information from the document image for character recognition, wherein the character recognition is to match the characteristic information with a trained optical character recognition model to determine characters in the document image, and finally outputting a recognition result into a text format which can be edited and processed by a computer, so as to extract text information of data in the data set, layout information and information about questions, question numbers and answers in the data set, and sort the information into data texts of corresponding JSON format files, and dividing the data texts into a retrieval data set and a test data set according to a preset proportion;

step 2: selecting an example sample, wherein the example sample is used for helping a large language model to understand a data format and a question-answer form required by a task;

the method comprises the steps of forming a set A of problems of all search samples in a search data set, extracting problems of any one test sample in the test data set, searching in the set A through an all-mpnet-base-v2 model in a content-converters algorithm, and respectively calculating semantic cosine similarity of the problems of the test sample and each problem in the set A so as to search out the problem of the search sample with the highest semantic similarity to the problems of the test sample, wherein the search sample corresponding to the problem of the search sample is used as an example sample in a prompt design, and the all-mpnet-base-v2 model maps sentences and paragraphs to a 768-dimensional dense vector space;

step 3: designing a layout perception prompt;

the prompt means that a specific text or language prompt is used for guiding a large language model to generate specific output, the data of the example sample and the test sample obtained in the step 2 are arranged in the prompt, the prompt is designed into a data flow format of a prompt head, a context sample and a test sample, and the text information and the layout information extracted in the step 1 are designed into a mapping relation { text: corresponding OCR box }, and are used as the data formats of the context sample and the test sample;

step 4: and transmitting the designed prompt to a large language model to perform a small sample document visual question-answer reasoning task, and generating an answer.

2. The method for visual language reasoning of a document based on layout-aware cues as defined in claim 1, wherein the preprocessing operation comprises denoising, binarization, rotation correction, tilt correction.

3. The method for visual language reasoning of a document based on layout-aware cues as claimed in claim 2, wherein the characteristic information of the character comprises outline, edge, projection.

4. A method of visual language reasoning for documents based on layout aware cues as claimed in claim 3, characterised in that the text formats that the computer can edit and process include TXT, JSON.

5. The method for visual language reasoning of a document based on layout awareness cues as defined in claim 4, in which the context sample comprises a context layout awareness cue, a context sample question, and a context sample answer; the test sample comprises a test sample layout perception prompt and a test sample problem.

6. The method for visual language reasoning of a document based on layout-aware cues as defined in claim 5, in which the large language model selects GPT-3.

7. The method for visual language reasoning about documents based on layout aware cues as defined in claim 6, wherein the dataset is DocVQA, infographicVQA or visual mrc.