CN117396899A - System and method for extracting fields from unlabeled data - Google Patents
System and method for extracting fields from unlabeled data Download PDFInfo
- Publication number
- CN117396899A CN117396899A CN202280036060.1A CN202280036060A CN117396899A CN 117396899 A CN117396899 A CN 117396899A CN 202280036060 A CN202280036060 A CN 202280036060A CN 117396899 A CN117396899 A CN 117396899A
- Authority
- CN
- China
- Prior art keywords
- field
- ple
- key
- words
- label
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 61
- 238000000605 extraction Methods 0.000 claims abstract description 50
- 238000009826 distribution Methods 0.000 claims description 11
- 230000000750 progressive effect Effects 0.000 claims description 8
- 230000001537 neural effect Effects 0.000 claims description 3
- 238000012549 training Methods 0.000 abstract description 32
- 230000008569 process Effects 0.000 abstract description 10
- 230000003993 interaction Effects 0.000 abstract description 3
- 230000004931 aggregating effect Effects 0.000 abstract description 2
- 238000005065 mining Methods 0.000 abstract description 2
- 238000010586 diagram Methods 0.000 description 14
- 238000012360 testing method Methods 0.000 description 14
- 241000208125 Nicotiana Species 0.000 description 6
- 235000002637 Nicotiana tabacum Nutrition 0.000 description 6
- 230000006872 improvement Effects 0.000 description 5
- 238000012015 optical character recognition Methods 0.000 description 5
- 238000013459 approach Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 4
- 238000002474 experimental method Methods 0.000 description 4
- 238000012545 processing Methods 0.000 description 4
- 238000010200 validation analysis Methods 0.000 description 4
- 230000007423 decrease Effects 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 238000010801 machine learning Methods 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000002372 labelling Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000000717 retained effect Effects 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 238000002679 ablation Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 210000001072 colon Anatomy 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 230000004807 localization Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 238000007670 refining Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000008685 targeting Effects 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/18—Extraction of features or characteristics of the image
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/217—Validation; Performance evaluation; Active pattern learning techniques
- G06F18/2178—Validation; Performance evaluation; Active pattern learning techniques based on feedback of a supervisor
- G06F18/2185—Validation; Performance evaluation; Active pattern learning techniques based on feedback of a supervisor the supervisor being an automated module, e.g. intelligent oracle
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/2431—Multiple classes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/254—Fusion techniques of classification results, e.g. of results related to same input data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
- G06V30/41—Analysis of document content
- G06V30/414—Extracting the geometrical structure, e.g. layout tree; Block segmentation, e.g. bounding boxes for graphics or text
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/04—Inference or reasoning models
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Multimedia (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Computer Graphics (AREA)
- Geometry (AREA)
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Image Analysis (AREA)
Abstract
实施方案描述了一种不需要用于训练的字段级注释的字段提取系统。具体来说,训练过程是通过使用简单的规则从未标记表单中挖掘伪标签来引导的。然后,使用基于变换器的结构来对输入表单中的文本令牌之间的交互进行建模,并相应地预测每个令牌的字段标签。伪标签用于监督变换器训练。由于伪标签是有噪声的,因此使用包含分支序列的细化模块来细化伪标签。细化分支中的每个执行字段标记并生成细化标签。在每个阶段,通过从所有先前分支集合的标签来优化分支,以降低标签噪声。
Embodiments describe a field extraction system that does not require field-level annotations for training. Specifically, the training process is guided by mining pseudo-labels from unlabeled forms using simple rules. Then, a transformer-based structure is used to model the interactions between text tokens in the input form and predict the field labels for each token accordingly. Pseudo labels are used to supervise transformer training. Since pseudo-labels are noisy, a refinement module containing branching sequences is used to refine the pseudo-labels. Each execution field in the refinement branch is marked and a refinement label is generated. At each stage, branches are optimized by aggregating labels from all previous branches to reduce label noise.
Description
交叉引用cross reference
本即时申请要求均于2021年9月24日提交的美国非临时申请号17/484,618和17/484,623的优先权,两者均为2021年5月17日提交的美国临时申请号63/189,579的非临时申请,并根据35U.S.C.119要求优先权。This immediate application claims priority from U.S. Non-Provisional Application Nos. 17/484,618 and 17/484,623, both filed on September 24, 2021, both of which are based on U.S. Provisional Application No. 63/189,579, filed on May 17, 2021 Non-provisional application, claiming priority under 35 U.S.C. 119.
所有上述申请通过引用的方式全部明确并入本文。All of the above applications are expressly incorporated by reference in their entirety.
技术领域Technical field
实施方案总体上涉及机器学习系统和计算机视觉,更具体地,涉及用于从具有未标记数据的表单中提取字段的机制。Embodiments relate generally to machine learning systems and computer vision, and more specifically to mechanisms for extracting fields from forms with unlabeled data.
背景技术Background technique
类似表单的文档,诸如账单、工资单和患者转诊表,通常用于日常业务工作流中。从各种表单中的字段提取通常是一项具有挑战性的任务。例如,即使对于相同的表单类型,如果表单由不同的供应商发行,文档布局和文本表示也可以不同,例如,来自不同公司的账单可能具有显著不同的设计,来自不同系统(例如,ADP和Workday)的工资单可能对于类似的信息具有不同的文本表示,和/或类似的情况。传统上,从这种表单文档中提取信息需要大量的人力。例如,通常给工作人员预期的表单字段的列表,例如,购买订单、账单编号和总量等,基于对表单的理解,基于这些字段提取它们对应的值。Form-like documents, such as bills, pay stubs, and patient referral forms, are commonly used in daily business workflows. Extracting fields from various forms is often a challenging task. For example, even for the same form type, the document layout and text representation can be different if the form is issued by a different vendor. For example, bills from different companies may have significantly different designs from different systems (e.g., ADP and Workday ) may have different textual representations of similar information, and/or similar situations. Traditionally, extracting information from such form documents requires a lot of labor. For example, workers are typically given a list of expected form fields, such as purchase orders, bill numbers, and totals, and their corresponding values are extracted based on those fields based on their understanding of the form.
因此,需要一种用于从表单文档中提取信息的高效系统。Therefore, there is a need for an efficient system for extracting information from form documents.
附图说明Description of the drawings
图1是示出根据本文描述的一个实施方案的从账单中提取字段的实施例的简化图。Figure 1 is a simplified diagram illustrating an example of extracting fields from a bill, according to one embodiment described herein.
图2是示出根据本文描述的实施方案的字段提取系统的整体自监督训练框架的简化图。Figure 2 is a simplified diagram illustrating the overall self-supervised training framework of the field extraction system according to embodiments described herein.
图3是示出根据本文描述的实施方案的用于利用伪标签集合(PLE)细化图2中描述的字段提取框架的示例框架的框图。3 is a block diagram illustrating an example framework for refining the field extraction framework described in FIG. 2 using pseudo-label sets (PLE), in accordance with implementations described herein.
图4是根据本文描述的一些实施方案的实现字段提取框架的计算设备的简化图。4 is a simplified diagram of a computing device implementing a field extraction framework in accordance with some embodiments described herein.
图5是根据一些实施方案的用于通过字段提取模型从具有未标记数据的表单进行字段提取的方法的简化图。Figure 5 is a simplified diagram of a method for field extraction from a form with unlabeled data by a field extraction model, according to some embodiments.
图6是根据一些实施方案的用于在通过字段提取模型从具有未标记数据的表单进行字段提取中的标签细化的方法的简化图。Figure 6 is a simplified diagram of a method for label refinement in field extraction from a form with unlabeled data through a field extraction model, according to some embodiments.
图7是根据一些实施方案的提供未标记表单数据的训练数据集的示例性键列表和日期类型的数据表。Figure 7 is an example key list and date type data table that provides a training data set of unlabeled form data, in accordance with some embodiments.
图8A至图8B是示出根据一些实施方案的示例性未标记表单的图。8A-8B are diagrams illustrating exemplary unlabeled forms in accordance with some embodiments.
图9至图16提供了根据一些实施方案的图1至图6中描述的字段提取模型的数据实验的示例性结果。Figures 9-16 provide exemplary results of data experiments for the field extraction model described in Figures 1-6, according to some embodiments.
在附图中,具有相同名称的元件具有相同或相似的功能。In the drawings, elements with the same name have the same or similar functions.
具体实施方式Detailed ways
机器学习系统已经广泛用于计算机视觉中,例如,模式识别、对象定位中等。一些最近的机器学习方法将表单字段提取公式化为字段值配对或字段标记。例如,一些现有的系统采用表示学习方法,该方法将字段和值候选作为输入,并利用度量学习技术来对正字段—值对实施高配对分数,对负字段—值对实施低分数。另一系统使用预训练的变换器,将文本及其位置作为输入。然而,这些现有的方法通常需要大量的字段级注释来进行训练。获取表单的字段级注释可能是相当昂贵和劳动密集型的,有时甚至是不可能的,因为(1)表单通常包含敏感信息,因此可用于训练目的的公共数据有限;以及(2)由于暴露私人信息的风险,使用外部注释器也是不可行的。Machine learning systems have been widely used in computer vision, for example, pattern recognition, object localization, etc. Some recent machine learning methods formulate form field extraction as field value pairing or field labeling. For example, some existing systems employ representation learning methods that take field and value candidates as input and utilize metric learning techniques to enforce high pairwise scores for positive field-value pairs and low scores for negative field-value pairs. Another system uses a pre-trained transformer that takes text and its location as input. However, these existing methods usually require extensive field-level annotations for training. Obtaining field-level annotations for forms can be quite expensive and labor-intensive, and sometimes even impossible, because (1) forms often contain sensitive information and therefore have limited public data available for training purposes; and (2) due to the exposure of private Information risk, using external annotators is also not feasible.
鉴于对从表单文档中提取信息的高效系统的需要,实施方案描述了一种不需要用于训练的字段级注释的字段提取系统。具体来说,训练过程是通过使用简单的规则从未标记表单中挖掘伪标签来引导的。然后,使用基于变换器的结构来对输入表单中的文本令牌之间的交互进行建模,并相应地预测每个令牌的字段标记。伪标签用于监督变换器训练。由于伪标签是有噪声的,因此使用包含分支序列的细化模块来细化伪标签。细化分支中的每个执行字段标记并生成细化标签。在每个阶段,通过从所有先前分支集合的标签来优化分支,以降低标签噪声。Given the need for an efficient system for extracting information from form documents, embodiments describe a field extraction system that does not require field-level annotations for training. Specifically, the training process is guided by mining pseudo-labels from unlabeled forms using simple rules. Then, a transformer-based structure is used to model the interactions between text tokens in the input form and predict the field tags for each token accordingly. Pseudo labels are used to supervise transformer training. Since pseudo-labels are noisy, a refinement module containing branching sequences is used to refine the pseudo-labels. Each execution field in the refinement branch is marked and a refinement label is generated. At each stage, branches are optimized by aggregating labels from all previous branches to reduce label noise.
例如,字段提取系统在来自未标记数据的自监督伪标签上进行训练。具体而言,字段提取系统检测单词集及它们在表单中的位置,并基于单词之间的几何规则识别字段值,例如,字段和字段值通常可以水平对齐并由冒号分隔。然后,识别的字段值可以用作伪标签来训练变换器网络,该网络对检测到的单词和位置进行编码以进行分类。For example, field extraction systems are trained on self-supervised pseudo-labels from unlabeled data. Specifically, the field extraction system detects sets of words and their position in the form, and identifies field values based on geometric rules between words, for example, fields and field values can often be aligned horizontally and separated by colons. The identified field values can then be used as pseudo-labels to train a transformer network that encodes detected words and locations for classification.
在一些实施方案中,可以使用多个伪标签集合(PLE)分支来细化用于训练的伪标签。具体而言,并行操作PLE分支以从检测到的单词和位置的编码表示生成预测分类。在每个分支处,通过比较该分支处的细化标签和由作为伪标签的“先前”PLE生成的预测标签而计算损失分量。然后对PLE分支上的损失分量求和,以联合更新PLE。In some embodiments, multiple pseudo-label ensemble (PLE) branches may be used to refine pseudo-labels for training. Specifically, the PLE branches are operated in parallel to generate predictive classifications from the encoded representations of detected words and locations. At each branch, a loss component is calculated by comparing the refined label at that branch with the predicted label generated by the "previous" PLE as a pseudo label. The loss components on the PLE branches are then summed to jointly update the PLE.
如本文所使用的,术语“网络”可以包括任何基于硬件或软件的框架,其包括任何人工智能网络或系统、神经网络或系统和/或在其上或与其一起实施的任何训练或学习模型。As used herein, the term "network" may include any hardware or software-based framework, including any artificial intelligence network or system, neural network or system and/or any training or learning model implemented on or with it.
如本文所使用的,术语“模块”可以包括执行一个或多个功能的基于硬件或软件的框架。在一些实施方案中,该模块可以在一个或多个神经网络上实施。As used herein, the term "module" may include a hardware or software-based framework that performs one or more functions. In some embodiments, this module may be implemented on one or more neural networks.
图1是示出根据本文描述的一个实施方案的从账单中提取字段的实施例的简化图100。传统上,在表单处理中,通常给工作人员预期的表单字段的列表,例如,购买订单、账单编号和总量,并且目标是基于对表单的理解提取它们对应的值。键,例如,账单#、PO号和总额,指的是表单中字段的具体文本表示,并且它是值定位的重要指标。键通常是值定位最重要的特征。因此,字段提取系统旨在从表单中不相关的信息中自动提取字段值,这对于提高处理效率和减少人力至关重要。Figure 1 is a simplified diagram 100 illustrating an example of extracting fields from a bill, according to one embodiment described herein. Traditionally, in form processing, workers are usually given a list of expected form fields, for example, purchase order, bill number, and total, and the goal is to extract their corresponding values based on an understanding of the form. Keys, such as Bill #, PO Number, and Total, refer to the specific textual representation of the fields in the form, and are important indicators of value positioning. The key is often the most important characteristic for value location. Therefore, field extraction systems are designed to automatically extract field values from irrelevant information in forms, which is crucial to improve processing efficiency and reduce manpower.
如在图表100中所示,表单包含各种短语,诸如“账单#”、“1234”、“PO号”、“000001”等。字段提取系统可以识别“PO号”102是定位键,然后确定值“1234”104、“00000001”103或“100.00”105中的任何一个是否与定位键匹配。这种匹配可以基于定位键102和值103至105之间的几何关系来确定。例如,可以应用基于规则的算法来确定该匹配,例如,值“0000001”103更可能是对应于定位键102的值,因为值103具有与定位键102的位置竖直对齐的位置。As shown in diagram 100, the form contains various phrases such as "Bill #", "1234", "PO Number", "000001", etc. The field extraction system can identify "PO number" 102 as the anchor key and then determine whether any of the values "1234" 104, "00000001" 103, or "100.00" 105 match the anchor key. This match may be determined based on the geometric relationship between position key 102 and values 103 to 105. For example, a rule-based algorithm may be applied to determine the match, e.g., the value "0000001" 103 is more likely to be the value corresponding to the anchor key 102 because the value 103 has a position that is vertically aligned with the position of the anchor key 102 .
不同于先前可以访问大规模标记表单的方法,基于规则的方法可以用于从未标记数据中生成有噪声的伪标签(例如,字段和值)。基于以下观察结果构建基于规则的算法:(1)字段值(例如,图1中的103)通常在表单中与一些键(例如,图1中的102)一起显示,并且该键(例如,图1中的102)是该字段的具体文本表示;(2)键和它们对应的值具有强的几何关系(如图1中所示,键在竖直或水平上大多紧挨着它们的值);(3)虽然表单的布局非常多样,但在不同的表单实例中通常使用一些键-文本(例如,字段购买订单的键-文本可以是“PO号”、“PO#”等);以及(4)字段值总是与某个日期类型相关联(例如,“账单日期”的值的数据类型是日期,并且“总量”的值的数据类型是金额或数字)。Unlike previous methods that gave access to large-scale labeling forms, rule-based methods can be used to generate noisy pseudo-labels (e.g., fields and values) from unlabeled data. The rule-based algorithm is built based on the following observations: (1) A field value (e.g., 103 in Figure 1) is usually displayed in a form with some key (e.g., 102 in Figure 1), and the key (e.g., Figure 1 102 in 1) is the specific text representation of the field; (2) the keys and their corresponding values have a strong geometric relationship (as shown in Figure 1, the keys are mostly next to their values vertically or horizontally) ;(3) Although the layout of the form is very diverse, some key-text is usually used in different form instances (for example, the key-text of the field purchase order can be "PO number", "PO#", etc.); and ( 4) Field values are always associated with a certain date type (for example, the data type of the value of "Bill Date" is date, and the data type of the value of "Total" is amount or number).
因此,基于规则的方法可以用于从大规模表单中为每个感兴趣的字段生成有用的伪标签。如图1中所示,首先基于表单中的文本和字段的可能键字符串之间的字符串-匹配来进行键定位102。然后,基于文本的数据类型及其与定位键102的几何关系来估计值103至105。Therefore, a rule-based approach can be used to generate useful pseudo-labels for each field of interest from large-scale forms. As shown in Figure 1, key positioning 102 is performed first based on a string-match between text in the form and possible key strings for the field. Values 103 to 105 are then estimated based on the data type of the text and its geometric relationship to the anchor key 102 .
图2是示出根据本文描述的实施方案的字段提取系统的整体自监督训练框架200的简化图。框架200包括光学字符识别模块205、变换器网络210和分类器220。未标记表单202,例如,支票、账单、工资单和/或类似物,可以包括预定义列表中的字段信息,{fd1,fd2,...,fdN}。给定表单作为输入,将通用OCR检测和识别模块205应用于未标记表单202,以获得单词集,{w1,w2,...,wM},它们的位置表示为边界框{b1,b2,…,bM}。因此,字段提取方法的目标是,如果输入表单中存在字段的信息,则从海量单词候选{w1,w2,…,wM}中自动提取与字段fdi匹配的目标值vi。Figure 2 is a simplified diagram illustrating an overall self-supervised training framework 200 for a field extraction system according to embodiments described herein. The framework 200 includes an optical character recognition module 205, a transformer network 210, and a classifier 220. An unmarked form 202, such as a check, bill, pay stub, and/or the like, may include field information from a predefined list, {fd 1 , fd 2 , ..., fd N }. Given a form as input, the general OCR detection and recognition module 205 is applied to the unlabeled form 202 to obtain a set of words, {w 1 , w 2 , ..., w M }, with their positions represented as bounding boxes {b 1 , b 2 ,…, b M }. Therefore, the goal of the field extraction method is to automatically extract the target value vi matching the field fd i from a large number of word candidates {w 1 , w 2 ,..., w M } if the field information exists in the input form.
然后可以将单词和边界框位置对{wi,Bi}输入到变换器编码器210以编码为特征表示。也可以将对{wi,Bi}发送到伪标签推理模块215,伪标签推理模块215配置为执行键定位和值估计,键定位识别对应于每个预定义字段的键的位置,值估计确定定位键的对应字段值。The word and bounding box location pairs { wi ,Bi } may then be input to the transformer encoder 210 to encode into feature representations. The pairs {w i , B i } may also be sent to pseudo-label inference module 215 configured to perform key location and value estimation, key location identifying the location of the key corresponding to each predefined field, and value estimation. Determine the corresponding field value for the targeting key.
例如,由于键和值可以包含多个单词,在接收到单词和边界框位置对{Wi,Bi}时,伪标签推理模块215可以使用DBSCAN算法(Ester等人,1996)基于它们的位置对附近识别的单词进行分组,以获得短语候选[phi 1,phi 2,...,phi T]和它们的位置[Bi 1,Bi 2,...,Bi T]。For example, since keys and values can contain multiple words, upon receiving word and bounding box location pairs {W i , B i }, the pseudo-label inference module 215 can use the DBSCAN algorithm (Ester et al., 1996) based on their locations Group nearby identified words to obtain phrase candidates [ phi 1 , phi 2 , ..., phi T ] and their positions [B i 1 , B i 2 , ..., B i T ] .
对于每个感兴趣的字段,fdi,常用键的列表,[ki 1,ki 2,...,ki L],是基于领域知识确定的。例如,字段名可以用作列表中的唯一键。然后,模块215可以测量短语候选phi j和每个设计的键ki r之间的字符串距离,为d(phi j,ki r)。模块215可以使用以下等式计算指示该候选作为字段的键的可能性的每个短语候选的键分数:For each field of interest, fd i , a list of commonly used keys, [k i 1 , k i 2 , ..., k i L ], is determined based on domain knowledge. For example, a field name can be used as a unique key in a list. Module 215 may then measure the string distance between the phrase candidate ph i j and each designed key k ir as d(ph i j , k i r ). Module 215 may calculate a key score for each phrase candidate that indicates the likelihood of that candidate being the key to the field using the following equation:
然后,通过查找具有最大键分数的候选来定位键,如下:Then, locate the key by finding the candidate with the largest key score, as follows:
伪标签推理模块215然后可以确定定位键的值(或者一个或多个值,如果适用的话)。具体来说,值是根据两个标准估计的。首先,它们的数据类型应该与它们的字段一致。第二,它们的位置应该与定位键非常一致。对于每个字段,可以预先确定合格数据类型的列表。例如,对于数据字段“账单号”,数据类型可以包括字符串或整数。可以使用预训练的基于BERT的模型来预测每个短语候选的数据类型,并且仅保留具有正确数据类型的候选phi j。Pseudo-label inference module 215 may then determine the value (or one or more values, if applicable) of the location key. Specifically, values are estimated based on two criteria. First, their data types should be consistent with their fields. Second, their position should be very consistent with the positioning keys. For each field, a list of qualified data types can be predetermined. For example, for the data field "Bill Number", the data type can include string or integer. A pretrained BERT-based model can be used to predict the data type of each phrase candidate, and only candidates with the correct data type are retained .
在一个实施方案中,为每个合格候选phi j确定值分数,如下:In one embodiment, a value score is determined for each qualified candidate ph i j as follows:
其中key_score指示定位键的键分数,/>指示候选与定位键之间的几何关系分数。键(例如,图1中的102)和它的值(例如,图1中的103)通常彼此靠近,并且值可能就在键的正下方或者位于它们的右侧。因此,确定几何关系(诸如距离和角度)来测量键-值关系:where key_score Indicates the key score of the anchor key, /> Indicates the geometric relationship score between the candidate and the anchor key. A key (eg, 102 in Figure 1) and its value (eg, 103 in Figure 1) are typically close to each other, and the value may be directly below the key or to the right of them. Therefore, determine geometric relationships (such as distances and angles) to measure key-value relationships:
其中指示两个短语的距离,/>指示从phi j到phi r的角度,并且φ(.|μ,σ)指示以μ为均值,σ为标准差的高斯函数。这里,μa被设置为0,σb和σa被固定为0.5。为了奖励相对于键的角度接近0或π/2的候选,朝向这两个选项的最大角度分数如下:in Indicates the distance between two phrases,/> indicates the angle from ph i j to ph i r , and φ(.|μ,σ) indicates the Gaussian function with μ as the mean and σ as the standard deviation. Here, μ a is set to 0, and σ b and σ a are fixed to 0.5. To reward candidates whose angle relative to the bond is close to 0 or π/2, the maximum angle score toward these two options is as follows:
因此,如等式(5)中,如果候选的值分数是所有候选中最大的并且分数超过阈值,例如,θv=0.1,则该候选被确定为字段的预测值。Therefore, as in Equation (5), if the value score of a candidate is the largest among all candidates and the score exceeds a threshold, for example, θ v =0.1, then the candidate is determined as the predicted value of the field.
在一个实施方案中,伪标签推理模块215的输出,例如,作为伪标签的字段的估计值,可以用作独立的字段提取输出。在另一个实施方案中,字段的估计值可以用作引导训练的伪标签,以进一步改善字段提取性能。具体来说,为了预测单词的目标标签,需要学习这个单词的意思以及它与周围上下文的相互作用。基于变换器的架构(例如,Xu等人中描述的LayoutLM,2020)可以用于学习单词的表示,因为它具有建模上下文信息的强大能力。除了语义表示,单词的位置和输入表单的总体布局也很重要,可以用于捕获单词的区别特征。变换器编码器210可以从输入对{Wi,Bi}提取特征:In one embodiment, the output of the pseudo-label inference module 215, eg, as an estimate of the field of the pseudo-label, may be used as a separate field extraction output. In another embodiment, the estimated value of a field can be used as a pseudo-label to guide training to further improve field extraction performance. Specifically, in order to predict the target label of a word, it is necessary to learn the meaning of the word and its interaction with the surrounding context. Transformer-based architectures (e.g., LayoutLM described in Xu et al., 2020) can be used to learn representations of words due to its strong ability to model contextual information. In addition to semantic representation, the position of words and the overall layout of the input form are also important and can be used to capture the distinguishing characteristics of words. Transformer encoder 210 can extract features from the input pair {W i , Bi } :
[f1,f2,…,fM]=T([(w1,b1),(w2,b2),...,(wM,bM)]),(6)[f 1 , f 2 ,..., f M ]=T([(w 1 , b 1 ), (w 2 , b 2 ),..., (w M , b M )]), (6)
其中T(.)表示基于变换器的特征提取器,fi指示单词i的特征。where T(.) represents the transformer-based feature extractor, and fi indicates the features of word i.
用于令牌分类的分类器220可以从变换器编码器210接收编码特征表示的输入,该变换器编码器从原始未标记表单202生成包括每个令牌的背景的预测字段。具体地,分类器220通过经由全连接(FC)层将特征投射入字段空间({背景,fd1,fd2,…,fdN))来生成字段预测分数sk。然后,可以在损失模块230处比较来自分类器220的预测字段分数和来自伪标签推理215的生成的伪标签,以生成训练目标。可以经由反向传播路径(由虚线示出)进一步利用训练目标来更新变换器210和分类器220。The classifier 220 for token classification may receive input of the encoded feature representation from a transformer encoder 210 that generates a prediction field that includes the context of each token from the original unlabeled form 202 . Specifically, the classifier 220 generates the field prediction score sk by projecting the features into the field space ({background, fd 1 , fd 2 , . . . , fd N )) via a fully connected (FC) layer. The predicted field scores from classifier 220 and the generated pseudo labels from pseudo label inference 215 may then be compared at loss module 230 to generate training targets. The training objectives may be further utilized to update the transformer 210 and classifier 220 via a backpropagation path (shown by the dashed lines).
在一个实施方案中,如图3中进一步描述的,多个渐进伪标签集合(PLE)可以用于引导训练。In one embodiment, as further described in Figure 3, multiple progressive pseudo-label sets (PLEs) can be used to guide training.
图3是示出根据本文描述的实施方案的用于利用PLE细化图2中描述的字段提取框架的示例框架300的框图。如图2中所描述的,变换器210接收从未标记表单202提取的单词以及围绕单词的边界框的位置的输入302,(w1,b1),(w2,b2),...,(wM,bM),基于该初始单词级字段标签(也称为引导标签)由伪标签推理模块215处的估计伪标签获得。因此,可以使用交叉熵损失L(sk,/>)来优化变换器网络210,该交叉熵损失是基于来自分类器220的字段预测分数和生成的引导标签而计算的。Figure 3 is a block diagram illustrating an example framework 300 for utilizing PLE to refine the field extraction framework described in Figure 2, in accordance with implementations described herein. As depicted in Figure 2, the transformer 210 receives an input 302 of a word extracted from the unmarked form 202 and the location of a bounding box surrounding the word, (w 1 , b 1 ), (w 2 , b 2 ), .. ., (w M , b M ), based on this initial word-level field label (also known as Guided labels) are obtained from the estimated pseudo labels at the pseudo label inference module 215. Therefore, the cross-entropy loss L(s k ,/> can be used ) to optimize the transformer network 210, the cross-entropy loss is calculated based on the field prediction scores from the classifier 220 and the generated guide labels.
然而,在训练中仅使用有噪声的引导标签作为基本事实可能会降低模型性能。在变换器210之后采用包括多个PLE的细化模块304,每个PLE用作分类分支。具体来说,在每个分支j,PLE独立地进行字段分类,并基于它们的预测细化伪标签使用从先前分支获得的细化标签来优化后级分支。However, using only noisy bootstrap labels as ground truth in training may degrade model performance. The transformer 210 is followed by a refinement module 304 that includes a plurality of PLEs, each PLE serving as a classification branch. Specifically, at each branch j, PLE independently performs field classification and refines pseudo labels based on their predictions Refine subsequent branches using the refinement labels obtained from previous branches.
例如,在分支k处,根据以下步骤生成细化标签:(1)通过argmax(skc)找到每个单词的预测字段标签以及(2)对于每个字段,只有当单词的预测分数在所有单词中最高并且大于阈值(固定为0.1)时,才保留该单词。例如,假设PLE模块304包括分支304a至304n。第一PLE分支304a可以接收从伪标签推理模块215生成的伪标签/>FC层基于其生成字段分类分数s1,然后将其转化为伪标签/>然后,引导标签/>和输出伪标签/>被馈送到第二PLE分支304b,FC层基于其生成字段分类分数s2,然后将其转化为伪标签/>按照类似的过程,第k个PLE分支接收引导标签/>和所有生成的伪标签/>FC层基于其生成字段分类分数sk,然后将其转化为伪标签/> For example, at branch k, the refined labels are generated according to the following steps: (1) Find the predicted field label of each word by argmax(s kc ) And (2) for each field, a word is retained only if its prediction score is the highest among all words and is greater than the threshold (fixed to 0.1). For example, assume that PLE module 304 includes branches 304a through 304n. The first PLE branch 304a may receive pseudo labels generated from the pseudo label reasoning module 215/> The FC layer generates a field classification score s 1 based on it, and then converts it into a pseudo label/> Then, bootstrap tag/> and output pseudo tags/> is fed to the second PLE branch 304b, based on which the FC layer generates a field classification score s 2 and then converts it into a pseudo label/> Following a similar process, the k-th PLE branch receives the boot label/> and all generated pseudo tags/> The FC layer generates a field classification score sk based on it, and then converts it into a pseudo label/>
因此,最终损失聚合所有损失,如下计算:Therefore, the final loss aggregates all losses and is calculated as follows:
其中β是控制初始伪标签的贡献的超参数。where β is a hyperparameter that controls the contribution of the initial pseudo-labels.
以这种方式,标签的渐进细化降低了标签噪声。然而,在每个级仅使用细化标签产生有限的性能改进,因为尽管标签在细化后变得更加精确,但是一些低置信度的值被过滤掉,这导致较低的召回率。为了缓解这个问题,每个分支都使用了所有先前级的集合标签进行了改进。集合标签不仅在精度和召回率之间保持了更好的平衡,而且更加多样化,可以用作模型优化的正则化。在推理期间,可以使用从所有分支预测的平均分数。可以应用类似的过程来获得最终字段值,如生成细化标签。In this way, progressive refinement of labels reduces label noise. However, using only refined labels at each level yields limited performance improvement because although labels become more precise after refinement, some low-confidence values are filtered out, which results in lower recall. To alleviate this problem, each branch is refined using the set labels of all previous levels. Set labels not only maintain a better balance between precision and recall, but are also more diverse and can be used as a regularizer for model optimization. During inference, the average score predicted from all branches can be used. A similar process can be applied to obtain final field values, such as generating refinement labels.
计算机环境computer environment
图4是根据本文描述的一些实施方案的实现字段提取框架的计算设备400的简化图。如图4中所示,计算设备400包括联接到存储器420的处理器410。计算设备400的操作由处理器410控制。并且尽管计算设备400示出了仅具有一个处理器410,但是应当理解,处理器410可以代表计算设备400中的一个或多个中央处理单元、多核处理器、微处理器、微控制器、数字信号处理器、现场可编程门阵列(FPGA)、专用集成电路(ASIC)、图形处理单元(GPU)和/或类似物。计算设备400可以被实施为独立子系统、添加到计算设备的板和/或虚拟机。Figure 4 is a simplified diagram of a computing device 400 implementing a field extraction framework in accordance with some embodiments described herein. As shown in FIG. 4 , computing device 400 includes processor 410 coupled to memory 420 . Operation of computing device 400 is controlled by processor 410. And although computing device 400 is shown with only one processor 410, it should be understood that processor 410 may represent one or more central processing units, multi-core processors, microprocessors, microcontrollers, digital Signal processor, field programmable gate array (FPGA), application specific integrated circuit (ASIC), graphics processing unit (GPU) and/or the like. Computing device 400 may be implemented as a standalone subsystem, a board added to the computing device, and/or a virtual machine.
存储器420可以用于存储由计算设备400执行的软件和/或在计算设备400的操作期间使用的一个或多个数据结构。存储器420可以包括一种或多种类型的机器可读介质。一些常见形式的机器可读介质可以包括软盘、软磁盘、硬盘、磁带、任何其他磁性介质、CD-ROM、任何其他光学介质、穿孔卡、纸带、任何其他具有孔图案的物理介质、RAM、PROM、EPROM、闪存EPROM、任何其他存储芯片或盒式存储器和/或处理器或计算机适于读取的任何其他介质。Memory 420 may be used to store software executed by computing device 400 and/or one or more data structures used during operation of computing device 400 . Memory 420 may include one or more types of machine-readable media. Some common forms of machine-readable media can include floppy disk, floppy disk, hard drive, magnetic tape, any other magnetic media, CD-ROM, any other optical media, punched cards, paper tape, any other physical medium with a pattern of holes, RAM, PROM , EPROM, flash EPROM, any other memory chip or cartridge and/or any other medium suitable for reading by the processor or computer.
处理器410和/或存储器420可以以任何合适的物理布置来布置。在一些实施方案中,处理器410和/或存储器420可以在相同的板上、相同的封装(例如,封装中的系统)、相同的芯片(例如,芯片上系统)等实现。在一些实施方案中,处理器410和/或存储器420可以包括分布式、虚拟化和/或容器化的计算资源。与这些实施方案一致,处理器410和/或存储器420可以位于一个或多个数据中心和/或云计算设施中。Processor 410 and/or memory 420 may be arranged in any suitable physical arrangement. In some embodiments, processor 410 and/or memory 420 may be implemented on the same board, the same package (eg, system-in-a-package), the same chip (eg, system-on-a-chip), etc. In some implementations, processor 410 and/or memory 420 may include distributed, virtualized, and/or containerized computing resources. Consistent with these embodiments, processor 410 and/or memory 420 may be located in one or more data centers and/or cloud computing facilities.
在一些实施例中,存储器420可以包括非暂时的、有形的、机器可读的介质,其包括可执行代码,当由一个或多个处理器(例如,处理器410)运行时,可执行代码可以使一个或多个处理器执行本文中进一步详细描述的方法。例如,如图所示,存储器420包括用于字段提取模块430的指令,其可以用于实施和/或模拟系统和模型,和/或实施本文中进一步描述的任何方法。在一些实施例中,字段提取模块430可以经由数据接口415接收输入440,例如,诸如表单的未标记图像实例。数据接口415可以是接收用户上传的表单图像实例的用户接口,或者可以是从数据库接收或检索先前存储的表单图像实例的通信接口中的任何一个。字段提取模块430可以生成输出450,诸如输入440的提取字段。In some embodiments, memory 420 may include non-transitory, tangible, machine-readable media that includes executable code that, when executed by one or more processors (eg, processor 410) One or more processors may be caused to perform the methods described in further detail herein. For example, as shown, memory 420 includes instructions for field extraction module 430, which may be used to implement and/or simulate systems and models, and/or implement any of the methods further described herein. In some embodiments, field extraction module 430 may receive input 440 via data interface 415, for example, an unlabeled image instance such as a form. Data interface 415 may be a user interface that receives form image instances uploaded by a user, or may be any of a communication interface that receives or retrieves previously stored form image instances from a database. Field extraction module 430 may generate output 450 such as the extracted fields of input 440 .
在一些实施方案中,字段提取模块430可还包括伪标签推理模块431和PLE模块432。伪标签推理模块431使用基于规则的方法从表单中挖掘噪声伪标签,例如,如图2中所描述的。PLE模块432(类似于图3中的细化模块304)可以在训练期间使用字段的估计值作为伪标签来学习数据驱动模型,该模型被实现为令牌分类任务,具有从表单提取的一组令牌的输入和包括每个令牌的背景的预测字段的输出。PLE模块432的进一步细节将结合图3进行讨论。In some implementations, the field extraction module 430 may further include a pseudo-label reasoning module 431 and a PLE module 432. The pseudo-label inference module 431 mines noisy pseudo-labels from the form using a rule-based approach, for example, as described in Figure 2. The PLE module 432 (similar to the refinement module 304 in Figure 3) can use the estimated values of the fields as pseudo labels during training to learn a data-driven model implemented as a token classification task with a set of extracted from the form Inputs of tokens and output including prediction fields for the context of each token. Further details of PLE module 432 will be discussed in conjunction with FIG. 3 .
字段提取工作流Field extraction workflow
图5是根据一些实施方案的用于通过字段提取模型从具有未标记数据的表单进行字段提取的方法500的简化图。方法500的过程中的一个或多个可以至少部分地以存储在非暂时性、有形、机器可读介质上的可执行代码的形式实施,当由一个或多个处理器运行时,可以使一个或多个处理器执行过程中的一个或多个。在一些实施方案中,方法500对应于字段提取模块430(图4)的操作,以执行字段提取或训练字段提取模型的方法。如图所示,方法500包括多个枚举的步骤,但是方法500的方面可以在枚举的步骤之前、之后和之间包括附加步骤。在某些方面,枚举的步骤中的一个或多个可以省略或以不同的顺序执行。Figure 5 is a simplified diagram of a method 500 for field extraction from a form with unlabeled data by a field extraction model, according to some embodiments. One or more of the processes of method 500 may be implemented, at least in part, in the form of executable code stored on a non-transitory, tangible, machine-readable medium that, when executed by one or more processors, may cause a or one or more of the processors executing the process. In some embodiments, method 500 corresponds to the operation of field extraction module 430 (FIG. 4) to perform field extraction or a method of training a field extraction model. As shown, method 500 includes a plurality of enumerated steps, but aspects of method 500 may include additional steps before, after, and between the enumerated steps. In some aspects, one or more of the enumerated steps may be omitted or performed in a different order.
在步骤502,经由数据接口(例如,图4中的415)接收包括多个字段和多个字段值的未标记表单。例如,未标记表单可以采取的表单类似于图8A至8B中所示的表单。At step 502, an untagged form including a plurality of fields and a plurality of field values is received via a data interface (eg, 415 in Figure 4). For example, the unmarked form may take a form similar to that shown in Figures 8A-8B.
在步骤504,在单词集的未标记表单内检测单词集和位置集。例如,可以通过图2中的OCR模块205来检测单词和位置。At step 504, word sets and position sets are detected within the unlabeled form of word sets. For example, words and locations can be detected by OCR module 205 in Figure 2.
在步骤506,至少部分地基于单词集之间的几何关系,从单词集和位置集识别字段的字段值。例如,可以通过应用第一规则来识别字段值,即键形式的一个或多个单词与字段的字段名相关。对于另一个实施例,可以通过应用第二规则来识别字段值,即水平或竖直对齐的成对的单词是字段和字段值的键。对于另一个实施例,可以通过应用第三规则来识别字段值,即来自单词集的与预定义键文本匹配的单词是字段的键。At step 506, field values for the fields are identified from the set of words and the set of locations based at least in part on geometric relationships between the sets of words. For example, a field value may be identified by applying a first rule that one or more words in the form of a key are related to the field name of the field. For another embodiment, field values may be identified by applying a second rule, namely horizontally or vertically aligned pairs of words that are keys to fields and field values. For another embodiment, the field value may be identified by applying a third rule that a word from a set of words that matches the predefined key text is the key of the field.
在一个实现方式中,确定对应于字段的键定位。例如,通过对附近识别的单词进行分组,从单词集确定短语候选集,并从位置集确定对应的短语位置集。计算每个短语候选的键分数,该键分数指示相应短语候选是字段的键的可能性。基于相应短语候选与预定义键之间的字符串距离而计算键分数,例如,见等式(1)。然后基于短语候选集中的最大键分数而确定字段的键,例如,见等式(2)。In one implementation, key positioning corresponding to the field is determined. For example, by grouping nearby recognized words, a set of phrase candidates is determined from a set of words, and a set of corresponding phrase positions is determined from a set of positions. A key score is calculated for each phrase candidate, which indicates the likelihood that the corresponding phrase candidate is a key to the field. The key score is calculated based on the string distance between the corresponding phrase candidate and the predefined key, see e.g. equation (1). The key of the field is then determined based on the maximum key score in the phrase candidate set, for example, see equation (2).
具体地,为了计算键分数,可以使用神经模型来预测每个短语候选的相应数据类型。然后确定具有与字段的预定义数据类型匹配的数据类型的短语候选的子集。为子集中的每个短语候选计算值分数,该值分数指示相应短语候选是字段的字段值的可能性。基于对应于字段的定位键的键分数以及相应短语候选与定位键之间的几何关系度量而计算值分数,例如,等式(3)。基于相应短语候选与定位键之间的字符串距离和角度而计算几何关系度量,例如,等式(4)。然后基于短语候选子集中的最大值分数而确定字段值。Specifically, in order to calculate the key score, a neural model can be used to predict the corresponding data type of each phrase candidate. A subset of phrase candidates having a data type that matches the field's predefined data type is then determined. A value score is calculated for each phrase candidate in the subset that indicates the likelihood that the corresponding phrase candidate is the field value of the field. The value score is calculated based on the key score corresponding to the locating key of the field and the geometric relationship measure between the corresponding phrase candidate and the locating key, for example, equation (3). The geometric relationship metric is calculated based on the string distance and angle between the corresponding phrase candidate and the positioning key, for example, equation (4). The field value is then determined based on the maximum score in the subset of phrase candidates.
在步骤508,编码器(例如,图2中的变换器编码器210)可以将对应于字段值的成对的第一单词和第一位置编码为第一表示。At step 508, an encoder (eg, transformer encoder 210 in FIG. 2) may encode the pair of first word and first position corresponding to the field value into a first representation.
在步骤510,分类器(例如,图2中的分类器220)可以由分类器从第一表示生成字段分类分布。At step 510, a classifier (eg, classifier 220 in Figure 2) may generate a field classification distribution from the first representation.
在步骤512,通过将字段分类分布与作为伪标签的字段值进行比较而计算第一损失目标。At step 512, a first loss target is calculated by comparing the field classification distribution with the field values as pseudo labels.
在步骤514,基于第一损失目标经由反向传播更新编码器。At step 514, the encoder is updated via backpropagation based on the first loss target.
图6是根据一些实施方案的用于在通过字段提取模型从具有未标记数据的表单进行字段提取中的标签细化的方法600的简化图。方法600的过程中的一个或多个可以至少部分地以存储在非暂时性、有形、机器可读介质上的可执行代码的形式实施,当由一个或多个处理器运行时,可以使一个或多个处理器执行过程中的一个或多个。在一些实施方案中,方法600对应于字段提取模块430(图4)的操作,以执行字段提取或训练字段提取模型的方法。如图所示,方法600包括多个枚举的步骤,但是方法600的方面可以包括枚举的步骤之前、之后和之间的附加步骤。在某些方面,枚举的步骤中的一个或多个可以省略或以不同的顺序执行。Figure 6 is a simplified diagram of a method 600 for label refinement in field extraction from a form with unlabeled data through a field extraction model, according to some embodiments. One or more of the processes of method 600 may be implemented, at least in part, in the form of executable code stored on a non-transitory, tangible, machine-readable medium that, when executed by one or more processors, may cause a or one or more of the processors executing the process. In some embodiments, method 600 corresponds to the operation of field extraction module 430 (FIG. 4) to perform field extraction or a method of training a field extraction model. As shown, method 600 includes a plurality of enumerated steps, but aspects of method 600 may include additional steps before, after, and between the enumerated steps. In some aspects, one or more of the enumerated steps may be omitted or performed in a different order.
在步骤602,经由数据接口(例如,图4中的415)接收包括多个字段和多个字段值的未标记表单。例如,未标记表单可以采取的表单类似于图8A-8B中所示的表单。At step 602, an untagged form including a plurality of fields and a plurality of field values is received via a data interface (eg, 415 in Figure 4). For example, the unmarked form may take a form similar to that shown in Figures 8A-8B.
在步骤604,在未标记表单内检测第一单词和第一单词的第一位置。例如,可以通过图2中的OCR模块205来检测单词和位置。At step 604, a first word and a first position of the first word within the unmarked form are detected. For example, words and locations can be detected by OCR module 205 in Figure 2.
在步骤606,编码器(例如,图2中的变换器编码器210)将成对的第一单词和第一位置编码为第一表示,例如,等式(6)。At step 606, an encoder (eg, transformer encoder 210 in Figure 2) encodes the pair of first word and first position into a first representation, eg, equation (6).
在步骤608,多个渐进标签集合(PLE)分支(例如,见图3中的304a至304n)并行地分别基于第一表示生成多个预测标签。多个PLE分支中的每个包括基于第一表示生成相应预测标签的相应分类器。一个PLE分支处的预测标签通过以下方式生成:经由一个或多个完全连接的层,将第一表示投射入字段预测分数集,以及基于单词集中的最大字段预测分数生成预测标签。当最大字段预测分数大于预定义阈值时,对于多个字段中的字段,从单词集中选择对应于最大字段预测分数的单词。At step 608, a plurality of progressive label set (PLE) branches (eg, see 304a to 304n in Figure 3) generate a plurality of predicted labels based on the first representation respectively in parallel. Each of the plurality of PLE branches includes a respective classifier that generates a respective predicted label based on the first representation. Predicted labels at one PLE branch are generated by projecting the first representation into a set of field prediction scores via one or more fully connected layers, and generating predicted labels based on the maximum field prediction score in the word set. When the maximum field prediction score is greater than a predefined threshold, for a field in multiple fields, the word corresponding to the maximum field prediction score is selected from the word set.
在步骤610,一个PLE分支通过比较在一个PLE分支处的预测标签和作为伪标签的来自先前PLE分支的预测标签而计算损失分量。At step 610, a PLE branch calculates a loss component by comparing the predicted label at one PLE branch with the predicted label from the previous PLE branch as a pseudo label.
在步骤612,将损失目标计算为多个PLE分支上的损失分量之和,例如,等式(7)。At step 612, the loss target is calculated as the sum of loss components over multiple PLE branches, eg, equation (7).
在步骤614,基于损失目标经由反向传播更新多个PLE分支。在一个实施方案中,来自多个PLE分支的第一PLE分支使用来自图5中步骤506的字段的识别字段值作为第一伪标签。通过将损失目标与在图5中的步骤512计算的第一损失目标相加而计算联合损失目标。然后基于联合损失目标联合更新编码器和多个PLE分支。At step 614, the plurality of PLE branches are updated via backpropagation based on the loss target. In one embodiment, the first PLE branch from the plurality of PLE branches uses the identification field value from the field of step 506 in Figure 5 as the first pseudo-label. The joint loss target is calculated by adding the loss target to the first loss target calculated at step 512 in Figure 5 . The encoder and multiple PLE branches are then jointly updated based on the joint loss objective.
示例性能Example performance
示例训练数据集可以包括从不同供应商收集的真实账单。例如,训练集包含2711个模板的7664个未标记的账单表单。验证集包含222个模板的348个标记的账单。测试集包含222个模板的339个标记的账单。每个模板在每个集中最多有5个图像。考虑了7个常用字段,包括账单编号、购买订单、账单日期、到期日期、到期金额、总量和总税款。An example training data set could include real bills collected from different vendors. For example, the training set contains 7664 unlabeled billing forms from 2711 templates. The validation set contains 348 labeled bills from 222 templates. The test set contains 339 labeled bills from 222 templates. Each template has a maximum of 5 images in each set. Seven common fields are considered, including bill number, purchase order, bill date, due date, amount due, total amount, and total tax.
对于烟草测试集,从用于公开发布的行业文献库2的烟草收藏中收集350张账单。内部账单内数据集的验证和测试集具有相似的字段统计分布,而公共烟草测试集则不同。例如,与训练数据集中的其他账单(如图8B所示)相比,烟草集的账单(如图8A所示)可能具有更低的分辨率和更杂乱的背景。For the tobacco test set, 350 bills were collected from the tobacco collection of the industry repository 2 for public release. The validation and test sets of the internal in-bill data set have similar field statistical distributions, while the public tobacco test set does not. For example, bills from the tobacco set (shown in Figure 8A) may have lower resolution and a more cluttered background than other bills in the training dataset (shown in Figure 8B).
字段上的端到端宏观平均F1分数被用作评估模型的度量。具体地,我们的预测值和基本事实值之间的精确字符串匹配用于计算真阳性、假阳性和假阴性。相应地获得每个字段的精度召回率和F1分数。报告的分数在5次运行中取平均值,以减少随机性的影响。The end-to-end macro average F1 score over a field is used as a metric to evaluate the model. Specifically, exact string matches between our predicted values and ground truth values are used to calculate true positives, false positives, and false negatives. The precision-recall and F1 score of each field are obtained accordingly. The reported scores are averaged over 5 runs to reduce the effect of randomness.
由于没有仅使用未标记数据执行字段提取的现有方法,因此构建了以下基线来验证我们的方法:引导标签(B标签):使用提出的简单规则推理的初始伪标签可以用于直接进行字段提取,而无需训练数据。变换器使用B标签进行训练:由于变换器被用作提取单词特征的主干,因此变换器模型使用B标签作为基线进行训练,以评估来自(1)流水线中的数据驱动模型和(2)细化模块的性能增益。文本的内容及其位置对于字段预测都很重要。变换器主干的实施例是LayoutLM,它将文本和位置都作为输入。此外,使用了两种流行的变换器模型,即,BERT和RoBERTa,它们只接受文本作为输入。Since there are no existing methods to perform field extraction using only unlabeled data, the following baselines are constructed to validate our approach: Guided Labels (B-labels): Initial pseudo-labels inferred using the proposed simple rules can be used to perform field extraction directly , without training data. Transformer is trained using B labels: Since the transformer is used as the backbone for extracting word features, the transformer model is trained using B labels as a baseline to evaluate data-driven models from (1) pipeline and (2) refinement Module performance gain. Both the content of the text and its position are important for field prediction. An example of a transformer backbone is LayoutLM, which takes both text and position as input. Furthermore, two popular transformer models are used, namely, BERT and RoBERTa, which only accept text as input.
OCR引擎用于检测单词及其位置,然后按照读取顺序对单词进行排序。图7的表1中示出了每个数据集的示例键列表和日期类型。键列表和数据类型非常广泛。α在等式(4)中设置为4.0。为了进一步去除假阳性,如果定位键不在其相邻区域内,则去除候选值。具体地,候选值周围的相邻区域一直延伸到图像的左侧,在其上方有四个候选高度,在其下方有一个候选高度。所有实验的细化分支数k=3。当级数>1时,在分类前添加一个隐藏的FC层,具有768个单元。对于所有账单实验,等式(7)中的β设置为1.0,除了图11的表4中的基于BERT的细化的β=5.0,因为它在验证集中的更好的性能。对于本文描述的字段提取模型和基线,在验证集中选择具有最佳F1分数的模型。为了防止过拟合,采用了两步训练策略,其中使用伪标签训练模型的第一个分支,然后在细化期间与特征提取器一起固定第一个分支。批量大小设置为8,并使用学习率为5e5的Adam优化器。The OCR engine is used to detect words and their positions and then sort the words in reading order. Example key lists and date types for each dataset are shown in Table 1 of Figure 7. The key list and data types are very extensive. α is set to 4.0 in equation (4). To further remove false positives, candidate values are removed if the location key is not within its adjacent region. Specifically, the adjacent area around the candidate value extends all the way to the left side of the image, with four candidate heights above it and one candidate height below it. The number of refinement branches k=3 for all experiments. When the number of levels > 1, a hidden FC layer with 768 units is added before classification. For all billing experiments, β in Equation (7) is set to 1.0, except for β = 5.0 for the BERT-based refinement in Table 4 of Figure 11 due to its better performance in the validation set. For the field extraction models and baselines described in this article, the model with the best F1 score in the validation set was selected. To prevent overfitting, a two-step training strategy is adopted, in which the first branch of the model is trained using pseudo-labels and then fixed together with the feature extractor during refinement. The batch size is set to 8 and the Adam optimizer with a learning rate of 5e 5 is used.
然后使用账单内数据集验证所提出的模型,因为它包含大规模的未标记训练数据和足够数量的有效/测试数据,这更适合我们的实验设置。首先使用LayoutLM作为主干来验证所提出的训练方法。图9的表2中和图10的表3中显示比较结果。引导标签(B-Labels)基线在有效集和测试集中分别达到43.8%和44.1%的F1分数,这指示我们的B-Labels具有合理的准确性,但仍然有噪声。当B标签用于训练LayoutLM变换器时,获得了显著的性能改进—有效集增加了约15%,测试集增加了约17%。添加PLE细化模块显著提高了模型精度——有效集约为6%,测试集约为7%——同时略微降低了召回率,有效集约为2.5%,测试集约为3%。这是因为细化标签在后面级变得越来越置信,导致更高的模型精度。然而,细化阶段也移除了一些导致较低召回率的低置信度的假阴性。总体而言,PLE细化模块进一步提高了性能,使F1分数提高了3%。The proposed model is then validated using the in-bill dataset since it contains large-scale unlabeled training data and a sufficient amount of valid/test data, which is more suitable for our experimental setup. First, LayoutLM is used as the backbone to verify the proposed training method. The comparison results are shown in Table 2 of Figure 9 and Table 3 of Figure 10 . The guided labels (B-Labels) baseline achieves F1 scores of 43.8% and 44.1% in the valid set and test set respectively, which indicates that our B-Labels have reasonable accuracy but are still noisy. When B labels are used to train the LayoutLM transformer, significant performance improvements are obtained—approximately 15% increase in the effective set and approximately 17% increase in the test set. Adding the PLE refinement module significantly improved model accuracy—by about 6% for the valid set and about 7% for the test set—while slightly reducing recall, by about 2.5% for the valid set and 3% for the test set. This is because the refined labels become increasingly confident in later stages, leading to higher model accuracy. However, the refinement stage also removes some low-confidence false negatives that result in lower recall. Overall, the PLE refinement module further improves performance, resulting in a 3% improvement in F1 score.
然后,LayoutLM被用作默认的特征主干,因为文本及其位置对我们的任务都很重要。此外,为了了解不同变换器模型作为主干的影响,评估了另外两个模型,BERT和RoBERTa,其中仅使用文本作为输入。图11的表4中和图12的表5中显示比较结果。据观察,当直接使用B-Labels和PLE细化模块训练BERT和RoBERTa时,实现了较大的改进,从而持续改进具有不同数量参数(基本或大)的不同变换器选择的基线结果。然而,与其他两个主干相比,LayoutLM仍然产生较高的结果,这表明文本位置对于获得良好的任务性能确实非常重要。Then, LayoutLM is used as the default feature backbone since both the text and its position are important to our task. Furthermore, to understand the impact of different transformer models as backbones, two other models, BERT and RoBERTa, were evaluated, which only use text as input. The comparison results are shown in Table 4 of Figure 11 and Table 5 of Figure 12 . It is observed that large improvements are achieved when BERT and RoBERTa are trained directly using B-Labels and PLE refinement modules, resulting in consistently improved baseline results for different transformer choices with different numbers of parameters (basic or large). However, LayoutLM still produces higher results compared to the other two backbones, indicating that text position is indeed very important to obtain good task performance.
然后使用图13的表6中引入的烟草测试集来测试所提出的模型。简单的基于规则的方法获得了25.1%的F1分数,这是合理的,但与我们内部账单内数据集的结果相比要低得多。原因是烟草测试集存在视觉噪声,这导致更多的文本识别错误。当使用B-Labels时,LayoutLM基线获得了很大的改进。此外,PLE细化模块进一步提高了约2%的F1分数。结果表明,所提出的方法很好地适应了不同的场景。在图8A至图8B中,示出了所提出的方法获得了良好的性能,尽管不同模板的样本账单非常多样化,具有杂乱的背景和低分辨率。The proposed model was then tested using the tobacco test set introduced in Table 6 of Figure 13. The simple rule-based approach achieved an F1 score of 25.1%, which is reasonable but much lower compared to the results on our in-house in-bill dataset. The reason is that the tobacco test set has visual noise, which leads to more text recognition errors. The LayoutLM baseline obtained great improvements when using B-Labels. Additionally, the PLE refinement module further improves the F1 score by approximately 2%. The results show that the proposed method adapts well to different scenarios. In Figures 8A to 8B, it is shown that the proposed method obtains good performance despite the fact that the sample bill for different templates is very diverse, with cluttered background and low resolution.
以基于LayoutLM为主干,在账单数据集上进一步进行消融研究。级数的影响:所提出的模型在k个级中被细化,同时在所有实验中固定k=3。它是用不同的级数来评估的。图15显示了当级数k增加时,模型通常在有效集和测试集上都执行得更好。多级的性能总是高于单级模型(我们的变换器基线)。当k=3时,模型性能达到最高。如图16中所示,在模型细化期间,当召回率下降时,精确度提高。当k=3时,精确度和召回率之间获得最佳平衡。当k>3时,召回率下降大于精确度提高,因此观察到更差的F1分数。Using LayoutLM as the backbone, further ablation research was conducted on the bill data set. Effect of the number of stages: The proposed model is refined in k stages while fixing k = 3 in all experiments. It is evaluated using different series. Figure 15 shows that when the number of stages k increases, the model generally performs better on both the valid set and the test set. The performance of multi-stage is always higher than that of single-stage model (our converter baseline). When k=3, the model performance reaches the highest level. As shown in Figure 16, during model refinement, while recall decreases, precision increases. When k=3, the best balance between precision and recall is obtained. When k > 3, the recall decreases more than the precision increase, so worse F1 scores are observed.
细化标签(R-Labels)的影响:为了分析这种设计的影响,在最终损失中移除所有细化标签,并且仅使用B-Labels来独立训练三个分支,并在推理期间集合预测。如图14的表7中所示,在有效集和测试集中,去除细化标签分别导致F1分数下降2.2%和2.6%。Impact of Refinement Labels (R-Labels): To analyze the impact of this design, all refinement labels were removed from the final loss and only B-Labels were used to train the three branches independently and ensemble predictions during inference. As shown in Table 7 of Figure 14, removing the refinement labels leads to a decrease in F1 score of 2.2% and 2.6% in the valid set and test set, respectively.
B-Labels正则化的影响。在每个级,B-Labels被用作一种正则化类型,以防止模型过拟合到过度置信的细化标签。通过在等式(7)中设置β=0,在细化阶段使用B-Labels。如图14的表7中所示,在没有这种正则化的情况下,模型性能在F1分数中下降了约2%。Effect of B-Labels regularization. At each level, B-Labels are used as a type of regularization to prevent the model from overfitting to overly confident refined labels. B-Labels are used in the refinement stage by setting β = 0 in equation (7). As shown in Table 7 of Figure 14, without this regularization, model performance drops by about 2% in F1 score.
两步训练策略的影响:为了避免过拟合噪声标签,采用了两步训练策略,其中使用B-Labels训练具有第一个分支的主干,然后在细化期间固定。通过单步训练模型来分析这种影响。在有效集和测试集中,单步训练分别导致1.8%和1.4%的F1分数下降。Impact of two-step training strategy: To avoid overfitting noisy labels, a two-step training strategy is adopted, where the backbone with the first branch is trained using B-Labels and then fixed during refinement. This effect is analyzed by training the model in a single step. Single-step training resulted in an F1 score drop of 1.8% and 1.4% in the valid set and test set, respectively.
计算设备(如计算设备400)的一些实施例可以包括非暂时的、有形的、机器可读的介质,其包括可执行代码,当由一个或多个处理器(例如,处理器410)运行时,可执行代码可以使一个或多个处理器执行方法400的过程。例如,可以包括方法400的过程的一些常见形式的机器可读介质可以是软盘、软磁盘、硬盘、磁带、任何其他磁性介质、CD-ROM、任何其他光学介质、穿孔卡、纸带、任何其他具有孔图案的物理介质、RAM、PROM、EPROM、闪存EPROM、任何其他存储芯片或盒式存储器和/或处理器或计算机适于读取的任何其他介质。Some embodiments of a computing device, such as computing device 400, may include non-transitory, tangible, machine-readable media that includes executable code that when executed by one or more processors (eg, processor 410) , the executable code may cause one or more processors to perform the processes of method 400. For example, some common forms of machine-readable media that may include the processes of method 400 may be floppy disks, floppy disks, hard disks, tapes, any other magnetic media, CD-ROMs, any other optical media, punched cards, paper tape, any other Hole pattern physical media, RAM, PROM, EPROM, Flash EPROM, any other memory chip or cartridge and/or any other media suitable for reading by a processor or computer.
说明发明方面、实施方案、实施方式或应用的本说明书和附图不应被视为限制性的。在不脱离本说明书和权利要求的精神和范围的情况下,可以进行各种机械、组成、结构、电气和操作改变。在一些情况下,为了不混淆公开文本的实施方案,没有详细示出或描述公知的电路、结构或技术。两个或多个图中的相似数字代表相同或相似的元素。This specification and drawings illustrating inventive aspects, embodiments, implementations or applications are not to be construed as limiting. Various mechanical, compositional, structural, electrical, and operational changes may be made without departing from the spirit and scope of the specification and claims. In some instances, well-known circuits, structures, or techniques have not been shown or described in detail so as not to obscure the embodiments of the disclosure. Similar numbers in two or more figures represent the same or similar elements.
在本说明书中,阐述了描述与公开文本一致的一些实施方案的具体细节。阐述了许多具体细节,以便提供对实施方案的透彻理解。然而,对于本领域技术人员来说,显而易见的是,可以在没有这些具体细节中的一些或全部的情况下实践一些实施方案。本文公开的具体实施方案旨在是说明性的,但不是限制性的。本领域技术人员可以实现到其他元素,尽管在这里没有具体描述,但是在公开文本的范围和精神内。此外,为了避免不必要的重复,与一个实施方案相关联示出和描述的一个或多个特征可以结合到其他实施方案中,除非另有具体描述,或者如果一个或多个特征将使实施方案不起作用。In this specification, specific details are set forth to describe certain embodiments consistent with the disclosure. Many specific details are set forth in order to provide a thorough understanding of the implementations. However, it will be apparent to one skilled in the art that some embodiments may be practiced without some or all of these specific details. The specific embodiments disclosed herein are intended to be illustrative and not restrictive. Those skilled in the art may implement other elements that, although not specifically described herein, are within the scope and spirit of the disclosure. Furthermore, to avoid unnecessary repetition, one or more features shown and described in connection with one embodiment may be combined in other embodiments unless otherwise specifically described, or if one or more features will render an embodiment doesn't work.
本申请参考附录I中的附件进一步描述,标题为“从具有未标记数据的表单进行提取字段”,9页,其被视为是公开文本的一部分,其全部内容通过引用结合。This application is further described with reference to the attachment in Appendix I, entitled "Field Extraction from Forms with Unmarked Data", page 9, which is deemed to be part of the disclosure and the entire contents of which are incorporated by reference.
尽管已经示出和描述了说明性实施方案,但是在前述公开中考虑了大范围的修改、改变和替换,并且在一些情况下,可以采用实施方案的一些特征,而不相应地使用其他特征。本领域普通技术人员将认识到许多变化、替代和修改。因此,本发明的范围应仅受所附权利要求的限制,并且适当的是广义地并以与本文公开的实施方案的范围一致的方式解释权利要求。Although illustrative embodiments have been shown and described, a wide range of modifications, changes and substitutions are contemplated in the foregoing disclosure, and in some cases some features of the embodiments may be employed without a corresponding use of other features. Those of ordinary skill in the art will recognize many variations, substitutions and modifications. Accordingly, the scope of the invention should be limited only by the appended claims, and it is appropriate that the claims be construed broadly and in a manner consistent with the scope of the embodiments disclosed herein.
Claims (40)
Applications Claiming Priority (7)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US202163189579P | 2021-05-17 | 2021-05-17 | |
US63/189,579 | 2021-05-17 | ||
US17/484,618 US12086698B2 (en) | 2021-05-17 | 2021-09-24 | Systems and methods for field extraction from unlabeled data |
US17/484,623 | 2021-09-24 | ||
US17/484,623 US20220366317A1 (en) | 2021-05-17 | 2021-09-24 | Systems and methods for field extraction from unlabeled data |
US17/484,618 | 2021-09-24 | ||
PCT/US2022/014013 WO2022245407A1 (en) | 2021-05-17 | 2022-01-27 | Systems and methods for field extraction from unlabeled data |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117396899A true CN117396899A (en) | 2024-01-12 |
Family
ID=89473672
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202280036060.1A Pending CN117396899A (en) | 2021-05-17 | 2022-01-27 | System and method for extracting fields from unlabeled data |
Country Status (3)
Country | Link |
---|---|
EP (1) | EP4341872A1 (en) |
JP (1) | JP2024522063A (en) |
CN (1) | CN117396899A (en) |
-
2022
- 2022-01-27 CN CN202280036060.1A patent/CN117396899A/en active Pending
- 2022-01-27 JP JP2023571264A patent/JP2024522063A/en active Pending
- 2022-01-27 EP EP22706709.7A patent/EP4341872A1/en active Pending
Also Published As
Publication number | Publication date |
---|---|
JP2024522063A (en) | 2024-06-11 |
EP4341872A1 (en) | 2024-03-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111126069B (en) | Social media short text named entity identification method based on visual object guidance | |
CN109583468B (en) | Training sample acquisition method, sample prediction method and corresponding device | |
CN111738172B (en) | Cross-domain object re-identification method based on feature adversarial learning and self-similarity clustering | |
Arshad et al. | Aiding intra-text representations with visual context for multimodal named entity recognition | |
Xie et al. | Detecting duplicate bug reports with convolutional neural networks | |
CN108520343A (en) | Risk model training method, Risk Identification Method, device, equipment and medium | |
WO2024109619A1 (en) | Sensitive data identification method and apparatus, device, and computer storage medium | |
CN110175851B (en) | Cheating behavior detection method and device | |
WO2020259280A1 (en) | Log management method and apparatus, network device and readable storage medium | |
US20230075290A1 (en) | Method for linking a cve with at least one synthetic cpe | |
CN112883990A (en) | Data classification method and device, computer storage medium and electronic equipment | |
CN113627151B (en) | Cross-modal data matching method, device, equipment and medium | |
CN113742733A (en) | Reading comprehension vulnerability event trigger word extraction and vulnerability type identification method and device | |
CN114495113B (en) | Text classification method and text classification model training method and device | |
CN108595568A (en) | A kind of text sentiment classification method based on very big unrelated multivariate logistic regression | |
Sheng et al. | Semantic-preserving abstractive text summarization with Siamese generative adversarial net | |
Murugesan et al. | ESTIMATION OF PRECISION IN FAKE NEWS DETECTION USING NOVEL BERT ALGORITHM AND COMPARISON WITH RANDOM FOREST. | |
CN117396899A (en) | System and method for extracting fields from unlabeled data | |
US12086698B2 (en) | Systems and methods for field extraction from unlabeled data | |
Matrane et al. | WeVoTe: A weighted voting technique for automatic sentiment annotation of Moroccan dialect comments | |
CN115080735A (en) | Relation extraction model optimization method and device and electronic equipment | |
CN113988059A (en) | Session data type identification method, system, equipment and storage medium | |
WO2022245407A1 (en) | Systems and methods for field extraction from unlabeled data | |
Paul et al. | Multi-facet universal schema | |
CN116028880B (en) | Method for training behavior intention recognition model, behavior intention recognition method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |