WO2023035332A1 - 一种日期提取方法、装置、计算机设备及存储介质 - Google Patents

一种日期提取方法、装置、计算机设备及存储介质 Download PDF

Info

Publication number
WO2023035332A1
WO2023035332A1 PCT/CN2021/120040 CN2021120040W WO2023035332A1 WO 2023035332 A1 WO2023035332 A1 WO 2023035332A1 CN 2021120040 W CN2021120040 W CN 2021120040W WO 2023035332 A1 WO2023035332 A1 WO 2023035332A1
Authority
WO
WIPO (PCT)
Prior art keywords
date
text segment
extracted
text
target
Prior art date
Application number
PCT/CN2021/120040
Other languages
English (en)
French (fr)
Inventor
程佳宇
陈永红
张军涛
王国鹏
Original Assignee
深圳前海环融联易信息科技服务有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳前海环融联易信息科技服务有限公司 filed Critical 深圳前海环融联易信息科技服务有限公司
Publication of WO2023035332A1 publication Critical patent/WO2023035332A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Definitions

  • the present application relates to the field of computer technology, in particular to a date extraction method, device, computer equipment and storage medium.
  • the materials to be manually processed often have the following two distinctive features: (1) The type of contract and the elements covered vary depending on the industry, including but not limited to real estate, medical care, and manufacturing , procurement and other industries, which raises the threshold for manual review of relevant materials, and also increases the difficulty of the review work; (2) There are too many similar elements, and it contains handwriting type, other seals, watermarks and other interference information, Increased the difficulty of precise extraction of elements. Regarding the extraction methods of various dates in the contract, there are generally two types:
  • the first is to sort out the positioning rules of keywords or key sentences based on business logic, and then combine regularization and other methods to match the required date format as the final candidate date. At the same time, for multiple candidate dates, the final target element value is selected in combination with relevant business rules.
  • the second application is more widely used in combination with deep learning for date element extraction, that is, the target value corresponding to the date is obtained through deep learning model prediction.
  • Embodiments of the present application provide a date extraction method, device, computer equipment, and storage medium, aiming at improving the accuracy and efficiency of date extraction.
  • the embodiment of the present application provides a date extraction method, including:
  • the target element of the date to be extracted is obtained, and the date is extracted according to the target element.
  • the embodiment of the present application provides a date extraction device, including:
  • a preprocessing unit configured to acquire a file image containing a date to be extracted, and perform preprocessing on the file image
  • the first acquisition unit is used to perform OCR recognition on the preprocessed file image, and obtain the target text segment containing the date to be extracted in combination with the associated information of the date to be extracted;
  • a tag marking unit configured to use NER technology to tag the target text segment, and output and obtain the date text segment;
  • a post-processing unit configured to classify and predict the date text segment through a classification model, and correct and post-process the date text segment based on the classification prediction result;
  • the date extracting unit is used to obtain the target elements of the date to be extracted according to the correction and post-processing results, and extract the date according to the target elements.
  • an embodiment of the present application provides a computer device, including a memory, a processor, and a computer program stored on the memory and operable on the processor.
  • the processor executes the computer program, Implement the date extraction method as described in the first aspect.
  • an embodiment of the present application provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the date extraction as described in the first aspect is implemented method.
  • the embodiment of the present application provides a date extraction method, device, computer equipment, and storage medium.
  • the method includes: acquiring a file image containing a date to be extracted, and performing preprocessing on the file image; performing preprocessing on the preprocessed file image OCR identification, and in conjunction with the associated information of the date to be extracted, obtain the target text segment that contains the date to be extracted; use NER technology to label the target text segment, and output the date text segment; pass the classification model to the date text segment Carry out classification prediction, and correct and post-process the date text segment based on the classification prediction result; obtain the target element of the date to be extracted according to the correction and post-processing result, and extract the date according to the target element.
  • the embodiment of the present application combines the related information to be extracted to locate the text segment of the date to be extracted, and uses OCR recognition and NER technology to identify and mark the document image or text segment, so that the target elements of the date to be extracted can be accurately obtained, so that By extracting the date, the accuracy and efficiency of extracting the date can be improved.
  • Fig. 1 is a schematic flow chart of a date extraction method provided in the embodiment of the present application.
  • Fig. 2 is a schematic subflow diagram of a date extraction method provided in the embodiment of the present application.
  • FIG. 3 is a schematic block diagram of a date extraction device provided in an embodiment of the present application.
  • Fig. 4 is a sub-schematic block diagram of a date extracting device provided by an embodiment of the present application.
  • FIG. 1 is a schematic flowchart of a date extraction method provided in the embodiment of the present application, which specifically includes steps S101 to S105.
  • S104 Perform classification prediction on the date text segment by using a classification model, and correct and post-process the date text segment based on the classification prediction result;
  • the interfering factors such as noise in the document image can be removed, and then the document data in the document image can be successfully identified through the OCR recognition technology, combined with the to-be-extracted
  • the associated information of the date is extracted from the identified file data and contains the target text segment of the date to be extracted.
  • NER Named Entity Recognition
  • the classification model to classify and predict the date text segment, and then correct the classification prediction results.
  • the target element corresponding to the date to be extracted can be obtained. According to the target element, the corresponding date to be extracted can be extracted.
  • This embodiment locates the text segment where the date to be extracted is located in combination with the related information to be extracted, which reduces the interference of other date elements in the file to extract the date element value to be extracted, and can improve the accuracy rate. And through OCR recognition and NER technology to identify and mark the document image or text segment, so that the target elements of the date to be extracted can be accurately obtained, and the date can be extracted to improve the accuracy and efficiency of date extraction.
  • the to-be-extracted date described in this embodiment may be the signing date in the contract document, or other dates such as the commencement and completion date, effective date, etc., and may be determined according to actual scenarios.
  • the step S101 includes:
  • the detected seal and watermark are removed through a generative confrontation network.
  • the direction of the document image is corrected, and the seals in the document image (mainly divided into square stamps, round stamps, seam stamps, tax stamps, Other chapters) and watermarks are detected, and GAN (generative confrontation network) removes the stamps or watermarks after detection, so as to reduce the interference of related noise in the document image on the accuracy of feature extraction.
  • the detected seal can also be used as a feature to identify the signed page.
  • the step S102 includes:
  • the accuracy and recognition degree are greatly improved, so the text recognition of the document image can be performed by using the printed matter OCR technology.
  • the target text segment corresponding to the date to be extracted is obtained.
  • the date to be extracted is the signing date
  • the dates that interfere with the signing date can be the start and completion date, effective date, etc.
  • the date to be extracted is usually in a fixed position.
  • the signing date usually appears on three types of pages: the cover page, the home page, and the signature page. Therefore, this embodiment combines the page information of the date to be extracted as auxiliary information for positioning, so that the positioning accuracy of the target text segment can be improved.
  • the step S103 includes:
  • the target feature is decoded by using a conditional random field to obtain a corresponding labeling sequence, and the labeling sequence is output as the date text segment.
  • the target text segment is marked based on NER technology to obtain the date text segment.
  • NER NER technology
  • CRF conditional random field
  • the training sample set can be used to train and optimize the NER technology to improve labeling efficiency and accuracy.
  • marking the signing date select 3000 real contract samples, enhance the extracted text to about 30W, and complete the training process of the entire NER technology based on the Bert pre-training model + Bi-LSTM network + CRF, in which the signing date corresponds to
  • the labels are B_signdate (the initial character of the signing date), I_signdate (the characters other than the initial character of the signing date).
  • the label corresponding to each text token is obtained through the trained NER technology, and the text segments with predicted labels B_signdate and I_signdate are extracted and returned as candidate signing date values.
  • the step S104 includes: steps S201-S204.
  • the contained text box is obtained, and a support vector machine (SVM) is used to mark whether the image area corresponding to the text box is a handwritten image. If it is a handwritten image, the text data in the handwritten image is recognized through handwritten OCR.
  • SVM support vector machine
  • step S104 also includes:
  • the date text is reviewed based on the scene of the date to be extracted.
  • post-processing includes date format verification, text error correction (mainly near characters), unified format, etc.
  • business rules are configured based on customized requirements, such as For multiple signing dates, take the latest date as the final target value, etc.
  • Customized requirements are mainly based on audit requirements. Different audit scenarios have different audit requirements for contracts. For example, if multiple signing dates take the latest date as the target value of the final signing date field, and if multiple signing dates take the date of the cover page is the target value for the Final Signing Date field, and so on.
  • performing date format verification, text error correction and unified format processing on the date text segment includes:
  • the error correction score probability value of the date text is calculated by using the N-Grams model, and the date text is corrected based on the error correction score probability value.
  • N-Gram is an algorithm based on a statistical language model, and its basic idea is to perform a sliding window operation with a size of N on the content of the date text according to bytes, forming a sequence of byte fragments with a length of N .
  • Each byte fragment is called a gram, which counts the frequency of occurrence of all grams, and filters them according to a preset threshold to form a list of key grams, which is the vector feature space of the date text, and each in the list
  • a gram is a feature vector dimension.
  • the appearance of the Nth word is only related to the previous N-1 words, but not to any other words.
  • the probability of the entire sentence is the product of the occurrence probabilities of each word. These probabilities can be obtained by counting the number of simultaneous occurrences of N words directly from the corpus.
  • FIG. 3 is a schematic block diagram of a date extraction device 300 provided in an embodiment of the present application.
  • the device 300 includes:
  • a preprocessing unit 301 configured to acquire a file image containing a date to be extracted, and perform preprocessing on the file image;
  • the first obtaining unit 302 is configured to carry out OCR recognition on the preprocessed document image, and obtain the target text segment containing the date to be extracted in combination with the associated information of the date to be extracted;
  • the labeling unit 303 is configured to use NER technology to label the target text segment, and output the date text segment;
  • a post-processing unit 304 configured to classify and predict the date text segment through a classification model, and correct and post-process the date text segment based on the classification prediction result;
  • the date extraction unit 305 is configured to obtain the target element of the date to be extracted according to the correction and post-processing results, and extract the date according to the target element.
  • the preprocessing unit 301 includes:
  • a correction unit configured to perform direction correction processing on the document image
  • a detection unit configured to detect a seal or watermark in the document image using Yolov5 technology
  • the removal unit is used to remove the detected seal and watermark through the generative confrontation network.
  • the first acquiring unit 302 includes:
  • a character recognition unit configured to perform character recognition on the document image through printed OCR technology
  • the positioning unit is used to locate the associated information of the date to be extracted based on the text recognition result, and use the positioning result as the target text segment; wherein, the associated information is the page information corresponding to the date to be extracted or the information associated with the date to be extracted Keyword information.
  • the labeling unit 303 includes:
  • the first extraction unit is used to utilize the Bert pre-training model to extract text features to the target text segment;
  • the second extraction unit is used to extract the target features required for entity recognition in the text features through the Bi-LSTM network;
  • the decoding output unit is configured to decode the target feature by using a conditional random field to obtain a corresponding labeling sequence, and output the labeling sequence as the date text segment.
  • the post-processing unit 304 includes:
  • the second acquiring unit 401 is configured to acquire a corresponding text box in the date text segment
  • a judging unit 402 configured to perform binary classification processing on each text box using a support vector machine, to judge whether the text box is a handwritten image
  • the handwriting recognition unit 403 is used to recognize the handwriting image through handwriting OCR technology if the text box is determined to be a handwriting image, and correct and post-process the recognition result;
  • the correction unit 404 is configured to continue to correct and post-process the date text segment if it is determined that the text box is not a handwritten image.
  • the post-processing unit 304 further includes:
  • a verification processing unit configured to perform date format verification, text error correction and unified format processing on the date text segment
  • An auditing unit configured to audit the date text based on the scene where the date to be extracted is located.
  • the verification processing unit includes:
  • a probability calculation unit configured to calculate the error correction score probability value of the date text by using the N-Grams model, and correct the date text based on the error correction score probability value.
  • the embodiment of the present application also provides a computer-readable storage medium, on which a computer program is stored. When the computer program is executed, the steps provided in the above-mentioned embodiments can be realized.
  • the storage medium may include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk, and other media capable of storing program codes.
  • the embodiment of the present application also provides a computer device, which may include a memory and a processor.
  • a computer program is stored in the memory.
  • the processor invokes the computer program in the memory, the steps provided in the above embodiments can be implemented.
  • the computer equipment may also include components such as various network interfaces and power supplies.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Character Discrimination (AREA)

Abstract

本申请公开了一种日期提取方法、装置、计算机设备及存储介质,该方法包括:获取包含待提取日期的文件图像,对所述文件图像进行预处理;对文件图像进行OCR识别,并结合待提取日期的关联信息获取包含待提取日期的目标文本段;利用NER技术对所述目标文本段进行标签标注,并输出得到日期文本段;通过分类模型对所述日期文本段进行分类预测,并基于分类预测结果对所述日期文本段进行修正及后处理;根据修正及后处理结果,获取待提取日期的目标要素,并根据所述目标要素提取日期。本申请结合待提取提起的关联信息对待提取日期所在文本段进行定位,并通过OCR识别和NER技术对文件图像或者文本段进行识别标注,可以提高对于日期的提取精度和提取效率。

Description

一种日期提取方法、装置、计算机设备及存储介质
本申请是以申请号为202111049925.1、申请日为2021年9月8日的中国专利申请为基础,并主张其优先权,该申请的全部内容在此作为整体引入本申请中。
技术领域
本申请涉及计算机技术领域,特别涉及一种日期提取方法、装置、计算机设备及存储介质。
背景技术
在各类合同的审阅过程中,待人工处理的材料往往具有以下两大鲜明的特点:(1)合同类型与覆盖的要素因行业的不同而多变,包括但不限于房地产、医疗、制造业、采购等行业,这就提高了对人工审核相关材料的门槛,同时也加大了审核工作的难度;(2)近似要素过多,且包含手写体类型、掺杂其他印章、水印等干扰信息,增加了要素精准提取的难度。关于合同中各种日期的提取方式,普通分为两种类型:
第一种是基于业务逻辑梳理关键字或关键句的定位规则,然后结合正则等方式匹配符合要求的日期格式,作为最终的候选日期。同时对于多个候选日期,结合相关业务规则选择最终的目标要素值。
第二种应用较为广泛的是结合深度学习进行日期要素提取,即通过深度学习模型预测得到日期对应的目标值。
针对上文提到的第一种现有方法,其缺陷首先是虽然提取日期的精度能够得到一定程度的保证,但是方法几乎没有鲁棒性,即换一种合同样式,或者换一种日期的上下文表述就不能做到提取效果达到预期。
针对上文提到的第二种现有方法,因合同中日期类的要素居多,如开工日期、竣工日期、签约日期、有效期等,且有些日期类要素还频繁出现多于一个的情况,这就导致模型很难去识别真正的目标要素,从而导致提取精度较差。
申请内容
本申请实施例提供了一种日期提取方法、装置、计算机设备及存储介质,旨在提高对于日期的提取精度和提取效率。
第一方面,本申请实施例提供了一种日期提取方法,包括:
获取包含待提取日期的文件图像,对所述文件图像进行预处理;
对经过预处理的文件图像进行OCR识别,并结合待提取日期的关联信息获取包含待提取日期的目标文本段;
利用NER技术对所述目标文本段进行标签标注,并输出得到日期文本段;
通过分类模型对所述日期文本段进行分类预测,并基于分类预测结果对所述日期文本段进行修正及后处理;
根据修正及后处理结果,获取待提取日期的目标要素,并根据所述目标要素提取日期。
第二方面,本申请实施例提供了一种日期提取装置,包括:
预处理单元,用于获取包含待提取日期的文件图像,对所述文件图像进行预处理;
第一获取单元,用于对经过预处理的文件图像进行OCR识别,并结合待提取日期的关联信息获取包含待提取日期的目标文本段;
标签标注单元,用于利用NER技术对所述目标文本段进行标签标注,并输出得到日期文本段;
后处理单元,用于通过分类模型对所述日期文本段进行分类预测,并基于分类预测结果对所述日期文本段进行修正及后处理;
日期提取单元,用于根据修正及后处理结果,获取待提取日期的目标要素,并根据所述目标要素提取日期。
第三方面,本申请实施例提供了一种计算机设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现如第一方面所述的日期提取方法。
第四方面,本申请实施例提供了一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现如第一方面所述的日期提取方法。
本申请实施例提供了一种日期提取方法、装置、计算机设备及存储介质,该方法包括:获取包含待提取日期的文件图像,对所述文件图像进行预处理;对经过预处理的文件图像进行OCR识别,并结合待提取日期的关联信息获取包含待提取日期的目标文本段;利用NER技术对所述目标文本段进行标签标注,并输出得到日期文本段;通过分类模型对所述日期文本段进行分类预测,并基于分类预测结果对所述日期文本段进行修正及后处理;根据修正及后处理结果,获取待提取日期的目标要素,并根据所述目标要素提取日期。本申请实施例结合待提取提起的关联信息对待提取日期所在文本段进行定位,并通过OCR识别和NER技术对文件图像或者文本段进行识别标注,从而可以精准获取待提取日期的目标要素,以此提取日期,便可以提高对于日期的提取精度和提取效率。
附图说明
为了更清楚地说明本申请实施例技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本申请实施例提供的一种日期提取方法的流程示意图;
图2为本申请实施例提供的一种日期提取方法的子流程示意图;
图3为本申请实施例提供的一种日期提取装置的示意性框图;
图4为本申请实施例提供的一种日期提取装置的子示意性框图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
应当理解,当在本说明书和所附权利要求书中使用时,术语“包括”和 “包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。
还应当理解,在此本申请说明书中所使用的术语仅仅是出于描述特定实施例的目的而并不意在限制本申请。如在本申请说明书和所附权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。
还应当进一步理解,在本申请说明书和所附权利要求书中使用的术语“和/ 或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。
下面请参见图1,图1为本申请实施例提供的一种日期提取方法的流程示意图,具体包括:步骤S101~S105。
S101、获取包含待提取日期的文件图像,对所述文件图像进行预处理;
S102、对经过预处理的文件图像进行OCR识别,并结合待提取日期的关联信息获取包含待提取日期的目标文本段;
S103、利用NER技术对所述目标文本段进行标签标注,并输出得到日期文本段;
S104、通过分类模型对所述日期文本段进行分类预测,并基于分类预测结果对所述日期文本段进行修正及后处理;
S105、根据修正及后处理结果,获取待提取日期的目标要素,并根据所述目标要素提取日期。
本实施例中,通过对包含待提取日期的文件图像进行预处理,可以去除文件图像中的诸如噪音等干扰因素,然后通过OCR识别技术可以顺利识别到文件图像中的文件数据,并结合待提取日期的关联信息在识别到的文件数据提取包含待提取日期的目标文本段。随后通过NER(命名实体识别)技术对目标文本段进行标注,以此进一步获取日期文本段,并在此基础上,利用分类模型对日期文本段进行分类预测,再对分类预测结果进行修正、后处理等操作,便可以得到待提取日期对应的目标要素。根据所述目标要素即可提取对应的待提取日期。
本实施例结合待提取提起的关联信息对待提取日期所在文本段进行定位,降低了文件中其他日期类要素对待提取日期要素值提取的干扰,可以提高准确率。并通过OCR识别和NER技术对文件图像或者文本段进行识别标注,从而可以精准获取待提取日期的目标要素,以此提取日期,便可以提高对于日期的提取精度和提取效率。本实施例所述的待提取日期可以是合同文件中的签约日期,也可以是开竣工日期、有效日期等其他日期,具体可以依据实际场景而定。
在一实施例中,所述步骤S101包括:
对所述文件图像进行方向矫正处理;
采用Yolov5技术对所述文件图像中的印章或水印进行检测;
通过生成式对抗网络将检测到的印章及水印去除。
本实施例中,在文件图像预处理阶段,基于图像的预处理技术,对文件图像的方向进行矫正,以及对文件图像中的印章(主要分为方章、圆章、骑缝章、印花税章、其他章)和水印进行检测,并在检测到印章或者水印后GAN(生成式对抗网络)进行去除,从而减少文件图像中的相关噪声对要素提取准确率的干扰。另外,检测到的印章还可以作为识别签署页的特征。在具体应用场景中,选用大概500张真实的合同带印章样本,同时写脚本生成10W+的印章图片,基于GAN实现印章的去除,修正率达到34%(修正率:印章去除后文本识别的准确率-印章去除前文本识别的准确率)。
在一实施例中,所述步骤S102包括:
通过印刷体OCR技术对所述文件图像进行文字识别;
基于文字识别结果对待提取日期的关联信息进行定位,并将定位结果作为所述目标文本段;其中,所述关联信息为待提取日期对应的页面信息或者与待提取日期关联的关键字信息。
本实施例中,由于文件图像经过预处理后,精度和识别度得到极大提高,因此可以通过印刷体OCR技术对所述文件图像进行文字识别。在文字识别过程中,根据待提取日期的关联信息(例如页面信息或者关键字等),获取待提取日期对应的目标文本段。在这里,由于文件图像中除了待提取日期以外,还会存在其他干扰日期,例如待提取日期为签约日期,那么对签约日期造成干扰的日期可以是开竣工日期、有效日期等。同时,待提取日期通常处于固定位置,例如签约日期通常会出现在封面、首页、签署页这三类页面。因此,本实施例结合了待提取日期的页面信息作为辅助信息进行定位,从而可以提到目标文本段的定位精度。
在一实施例中,所述步骤S103包括:
利用Bert预训练模型对所述目标文本段提取文本特征;
通过Bi-LSTM网络在所述文本特征中提取实体识别所需的目标特征;
采用条件随机场对所述目标特征进行解码处理,得到对应的标注序列,并将所述标注序列作为所述日期文本段输出。
本实施例中,基于NER技术对目标文本段进行标注,以得到日期文本段。具体的,首先利用Bert预训练模型提取所述目标文本段中的文本特征,并以此构建特征向量,再通过Bi-LSTM网络(双向长短时记忆循环神经网络)对该特征向量提取目标特征,然后基于条件随机场(CRF)进行解码操作,如此得到所述日期文本段。
当然,在进行预测标注之前,可以采用训练样本集对NER技术进行训练优化,以提高标注效率和精度。例如,在标注签约日期时,选取3000份真实的合同样本,将提取的文本增强到约30W,基于Bert预训练模型+Bi-LSTM网络+ CRF完成整个NER技术的训练过程,其中签约日期对应的标签为B_signdate(签约日期的起始字符)、I_signdate(签约日期除起始字符外的其它字符)。此后,通过训练好的NER技术得到每个文本token对应的标签,并提取出预测标签为B_signdate、I_signdate的文本段作为候选签约日期值返回。
在一实施例中,如图2所示,所述步骤S104包括:步骤S201~S204。
S201、获取所述日期文本段中对应的文本框;
S202、采用支持向量机每一文本框进行二分类处理,以判断文本框是否为手写体图像;
S203、若判定文本框为手写体图像,则通过手写体OCR技术对所述手写体图像进行识别,并对识别结果进行修正及后处理;
S204、若判定文本框不为手写体图像,则继续对所述日期文本段进行修正及后处理。
本实施例中,通过印刷体OCR识别日期文本段后,获取所包含的文本框,并通过支持向量机(SVM)标记文本框对应的图像区域是不是手写体图像。若是手写体图像,则通过手写体OCR识别该手写体图像中的文本数据。可以理解的是,本实施例所述的手写体OCR与前述印刷体OCR的识别对象并不相同,即手写体OCR的识别对象为手写体数据,而印刷体OCR的识别对象为印刷数据。
在一实施例中,所述步骤S104还包括:
对所述日期文本段进行日期格式校验、文本纠错以及统一格式处理;
基于待提取日期所处场景对所述日期文本进行审核。
本实施例中,在后处理与业务规则阶段,其中后处理包括日期格式校验、文本纠错(主要是形近字)、统一格式等,业务规则即是基于定制化的需求进行配置,如多个签约日期取最晚的日期为最终目标值等。定制化需求主要从审核需求出发,不同的审核场景对于合同的审核需求不一样,例如:多个签约日期取最晚的日期为最终签约日期字段的目标值、多个签约日期取封面页的日期为最终签约日期字段的目标值等等。
在一实施例中,所述对所述日期文本段进行日期格式校验、文本纠错以及统一格式处理,包括:
利用N-Grams模型计算所述日期文本的纠错得分概率值,并基于所述纠错得分概率值对所述日期文本进行修正。
本实施例中,N-Gram是一种基于统计语言模型的算法,其基本思想是将日期文本里面的内容按照字节进行大小为N的滑动窗口操作,形成了长度是N的字节片段序列。每一个字节片段称为gram,对所有gram的出现频度进行统计,并且按照事先设定好的阈值进行过滤,形成关键gram列表,也就是日期文本文本的向量特征空间,列表中的每一种gram就是一个特征向量维度。第N个词的出现只与前面N-1个词相关,而与其它任何词都不相关,整句的概率就是各个词出现概率的乘积。这些概率可以通过直接从语料中统计N个词同时出现的次数得到。
图3为本申请实施例提供的一种日期提取装置300的示意性框图,该装置300包括:
预处理单元301,用于获取包含待提取日期的文件图像,对所述文件图像进行预处理;
第一获取单元302,用于对经过预处理的文件图像进行OCR识别,并结合待提取日期的关联信息获取包含待提取日期的目标文本段;
标签标注单元303,用于利用NER技术对所述目标文本段进行标签标注,并输出得到日期文本段;
后处理单元304,用于通过分类模型对所述日期文本段进行分类预测,并基于分类预测结果对所述日期文本段进行修正及后处理;
日期提取单元305,用于根据修正及后处理结果,获取待提取日期的目标要素,并根据所述目标要素提取日期。
在一实施例中,所述预处理单元301包括:
矫正单元,用于对所述文件图像进行方向矫正处理;
检测单元,用于采用Yolov5技术对所述文件图像中的印章或水印进行检测;
去除单元,用于通过生成式对抗网络将检测到的印章及水印去除。
在一实施例中,所述第一获取单元302包括:
文字识别单元,用于通过印刷体OCR技术对所述文件图像进行文字识别;
定位单元,用于基于文字识别结果对待提取日期的关联信息进行定位,并将定位结果作为所述目标文本段;其中,所述关联信息为待提取日期对应的页面信息或者与待提取日期关联的关键字信息。
在一实施例中,所述标签标注单元303包括:
第一提取单元,用于利用Bert预训练模型对所述目标文本段提取文本特征;
第二提取单元,用于通过Bi-LSTM网络在所述文本特征中提取实体识别所需的目标特征;
解码输出单元,用于采用条件随机场对所述目标特征进行解码处理,得到对应的标注序列,并将所述标注序列作为所述日期文本段输出。
在一实施例中,如图4所示,所述后处理单元304包括:
第二获取单元401,用于获取所述日期文本段中对应的文本框;
判断单元402,用于采用支持向量机每一文本框进行二分类处理,以判断文本框是否为手写体图像;
手写体识别单元403,用于若判定文本框为手写体图像,则通过手写体OCR技术对所述手写体图像进行识别,并对识别结果进行修正及后处理;
修正单元404,用于若判定文本框不为手写体图像,则继续对所述日期文本段进行修正及后处理。
在一实施例中,所述后处理单元304还包括:
校验处理单元,用于对所述日期文本段进行日期格式校验、文本纠错以及统一格式处理;
审核单元,用于基于待提取日期所处场景对所述日期文本进行审核。
在一实施例中,所述校验处理单元包括:
概率值计算单元,用于利用N-Grams模型计算所述日期文本的纠错得分概率值,并基于所述纠错得分概率值对所述日期文本进行修正。
由于装置部分的实施例与方法部分的实施例相互对应,因此装置部分的实施例请参见方法部分的实施例的描述,这里暂不赘述。
本申请实施例还提供了一种计算机可读存储介质,其上存有计算机程序,该计算机程序被执行时可以实现上述实施例所提供的步骤。该存储介质可以包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
本申请实施例还提供了一种计算机设备,可以包括存储器和处理器,存储器中存有计算机程序,处理器调用存储器中的计算机程序时,可以实现上述实施例所提供的步骤。当然计算机设备还可以包括各种网络接口,电源等组件。
说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似部分互相参见即可。对于实施例公开的系统而言,由于其与实施例公开的方法相对应,所以描述的比较简单,相关之处参见方法部分说明即可。应当指出,对于本技术领域的普通技术人员来说,在不脱离本申请原理的前提下,还可以对本申请进行若干改进和修饰,这些改进和修饰也落入本申请权利要求的保护范围内。
还需要说明的是,在本说明书中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的状况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。

Claims (10)

  1. 一种日期提取方法,其特征在于,包括:
    获取包含待提取日期的文件图像,对所述文件图像进行预处理;
    对经过预处理的文件图像进行OCR识别,并结合待提取日期的关联信息获取包含待提取日期的目标文本段;
    利用NER技术对所述目标文本段进行标签标注,并输出得到日期文本段;
    通过分类模型对所述日期文本段进行分类预测,并基于分类预测结果对所述日期文本段进行修正及后处理;
    根据修正及后处理结果,获取待提取日期的目标要素,并根据所述目标要素提取日期。
  2. 根据权利要求1所述的日期提取方法,其特征在于,所述获取包含待提取日期的文件图像,对所述文件图像进行预处理,包括:
    对所述文件图像进行方向矫正处理;
    采用Yolov5技术对所述文件图像中的印章或水印进行检测;
    通过生成式对抗网络将检测到的印章及水印去除。
  3. 根据权利要求1所述的日期提取方法,其特征在于,所述对经过预处理的文件图像进行OCR识别,并结合待提取日期的关联信息获取包含待提取日期的目标文本段,包括:
    通过印刷体OCR技术对所述文件图像进行文字识别;
    基于文字识别结果对待提取日期的关联信息进行定位,并将定位结果作为所述目标文本段;其中,所述关联信息为待提取日期对应的页面信息或者与待提取日期关联的关键字信息。
  4. 根据权利要求1所述的日期提取方法,其特征在于,所述利用NER技术对所述目标文本段进行标签标注,并输出得到日期文本段,包括:
    利用Bert预训练模型对所述目标文本段提取文本特征;
    通过Bi-LSTM网络在所述文本特征中提取实体识别所需的目标特征;
    采用条件随机场对所述目标特征进行解码处理,得到对应的标注序列,并将所述标注序列作为所述日期文本段输出。
  5. 根据权利要求1所述的日期提取方法,其特征在于,所述通过分类模型对所述日期文本段进行分类预测,并基于分类预测结果对所述日期文本段进行修正及后处理,包括:
    获取所述日期文本段中对应的文本框;
    采用支持向量机每一文本框进行二分类处理,以判断文本框是否为手写体图像;
    若判定文本框为手写体图像,则通过手写体OCR技术对所述手写体图像进行识别,并对识别结果进行修正及后处理;
    若判定文本框不为手写体图像,则继续对所述日期文本段进行修正及后处理。
  6. 根据权利要求1所述的日期提取方法,其特征在于,所述通过分类模型对所述日期文本段进行分类预测,并基于分类预测结果对所述日期文本段进行修正及后处理,还包括:
    对所述日期文本段进行日期格式校验、文本纠错以及统一格式处理;
    基于待提取日期所处场景对所述日期文本进行审核。
  7. 根据权利要求6所述的日期提取方法,其特征在于,所述对所述日期文本段进行日期格式校验、文本纠错以及统一格式处理,包括:
    利用N-Grams模型计算所述日期文本的纠错得分概率值,并基于所述纠错得分概率值对所述日期文本进行修正。
  8. 一种日期提取装置,其特征在于,包括:
    预处理单元,用于获取包含待提取日期的文件图像,对所述文件图像进行预处理;
    第一获取单元,用于对经过预处理的文件图像进行OCR识别,并结合待提取日期的关联信息获取包含待提取日期的目标文本段;
    标签标注单元,用于利用NER技术对所述目标文本段进行标签标注,并输出得到日期文本段;
    后处理单元,用于通过分类模型对所述日期文本段进行分类预测,并基于分类预测结果对所述日期文本段进行修正及后处理;
    日期提取单元,用于根据修正及后处理结果,获取待提取日期的目标要素,并根据所述目标要素提取日期。
  9. 一种计算机设备,其特征在于,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现如权利要求1至7任一项所述的日期提取方法。
  10. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现如权利要求1至7任一项所述的日期提取方法。
PCT/CN2021/120040 2021-09-08 2021-09-24 一种日期提取方法、装置、计算机设备及存储介质 WO2023035332A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111049925.1A CN113762160A (zh) 2021-09-08 2021-09-08 一种日期提取方法、装置、计算机设备及存储介质
CN202111049925.1 2021-09-08

Publications (1)

Publication Number Publication Date
WO2023035332A1 true WO2023035332A1 (zh) 2023-03-16

Family

ID=78793903

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/120040 WO2023035332A1 (zh) 2021-09-08 2021-09-24 一种日期提取方法、装置、计算机设备及存储介质

Country Status (2)

Country Link
CN (1) CN113762160A (zh)
WO (1) WO2023035332A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114444508A (zh) * 2022-01-29 2022-05-06 北京有竹居网络技术有限公司 日期识别方法、装置、可读介质及电子设备

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109902303A (zh) * 2019-03-01 2019-06-18 腾讯科技(深圳)有限公司 一种实体识别方法及相关设备
CN110648211A (zh) * 2018-06-07 2020-01-03 埃森哲环球解决方案有限公司 数据验证
CN111160335A (zh) * 2020-01-02 2020-05-15 腾讯科技(深圳)有限公司 基于人工智能的图像水印处理方法、装置及电子设备
WO2021012570A1 (zh) * 2019-07-22 2021-01-28 深圳壹账通智能科技有限公司 数据录入方法、装置、设备及存储介质
CN112712085A (zh) * 2020-12-28 2021-04-27 哈尔滨工业大学 一种提取多语言pdf文档中日期的方法
US20210256160A1 (en) * 2020-02-19 2021-08-19 Harrison-Ai Pty Ltd Method and system for automated text anonymisation

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10489502B2 (en) * 2017-06-30 2019-11-26 Accenture Global Solutions Limited Document processing
CN113032586B (zh) * 2021-03-19 2023-11-03 京东科技控股股份有限公司 对文本中的时间信息进行提取的方法、装置及电子设备

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110648211A (zh) * 2018-06-07 2020-01-03 埃森哲环球解决方案有限公司 数据验证
CN109902303A (zh) * 2019-03-01 2019-06-18 腾讯科技(深圳)有限公司 一种实体识别方法及相关设备
WO2021012570A1 (zh) * 2019-07-22 2021-01-28 深圳壹账通智能科技有限公司 数据录入方法、装置、设备及存储介质
CN111160335A (zh) * 2020-01-02 2020-05-15 腾讯科技(深圳)有限公司 基于人工智能的图像水印处理方法、装置及电子设备
US20210256160A1 (en) * 2020-02-19 2021-08-19 Harrison-Ai Pty Ltd Method and system for automated text anonymisation
CN112712085A (zh) * 2020-12-28 2021-04-27 哈尔滨工业大学 一种提取多语言pdf文档中日期的方法

Also Published As

Publication number Publication date
CN113762160A (zh) 2021-12-07

Similar Documents

Publication Publication Date Title
CN110209823B (zh) 一种多标签文本分类方法及系统
US20230129874A1 (en) Pre-trained contextual embedding models for named entity recognition and confidence prediction
CN112801010B (zh) 一种针对实际ocr场景下的视觉富文档信息抽取方法
Fischer et al. Lexicon-free handwritten word spotting using character HMMs
Rodríguez-Serrano et al. Handwritten word-spotting using hidden Markov models and universal vocabularies
CN111160343B (zh) 一种基于Self-Attention的离线数学公式符号识别方法
AU2019219746A1 (en) Artificial intelligence based corpus enrichment for knowledge population and query response
RU2613846C2 (ru) Метод и система извлечения данных из изображений слабоструктурированных документов
US11914963B2 (en) Systems and methods for determining and using semantic relatedness to classify segments of text
CN113033183B (zh) 一种基于统计量与相似性的网络新词发现方法及系统
CN116432655B (zh) 基于语用知识学习的少样本命名实体识别方法和装置
CN112926345A (zh) 基于数据增强训练的多特征融合神经机器翻译检错方法
CN111783710B (zh) 医药影印件的信息提取方法和系统
Hasnat et al. A high performance domain specific OCR for Bangla script
CN112464845A (zh) 票据识别方法、设备及计算机存储介质
WO2023035332A1 (zh) 一种日期提取方法、装置、计算机设备及存储介质
CN114416991A (zh) 一种基于prompt的文本情感原因分析方法和系统
CN112989839A (zh) 一种基于关键词特征嵌入语言模型的意图识别方法及系统
US20230110931A1 (en) Method and Apparatus for Data Structuring of Text
CN115481637A (zh) 基于uc-flat的交通肇事案件法律文书命名实体识别方法
CN111400606B (zh) 一种基于全局和局部信息抽取的多标签分类方法
CN111461109B (zh) 一种基于环境多种类词库识别单据的方法
Kumari et al. Page level input for handwritten text recognition in document images
CN111797612A (zh) 一种自动化数据功能项抽取的方法
CN111078869A (zh) 基于神经网络对金融网站进行分类的方法及装置

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE