CN115617951A - Contract information extraction method, contract information extraction device, computer apparatus, contract information extraction medium, and program product - Google Patents

Contract information extraction method, contract information extraction device, computer apparatus, contract information extraction medium, and program product Download PDF

Info

Publication number
CN115617951A
CN115617951A CN202211360785.4A CN202211360785A CN115617951A CN 115617951 A CN115617951 A CN 115617951A CN 202211360785 A CN202211360785 A CN 202211360785A CN 115617951 A CN115617951 A CN 115617951A
Authority
CN
China
Prior art keywords
text data
data
extraction
contract information
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211360785.4A
Other languages
Chinese (zh)
Inventor
胡诗雨
石明
王巍
李捷
厉超
涂洪健
徐柯文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Pudong Development Bank Co Ltd
Original Assignee
Shanghai Pudong Development Bank Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Pudong Development Bank Co Ltd filed Critical Shanghai Pudong Development Bank Co Ltd
Priority to CN202211360785.4A priority Critical patent/CN115617951A/en
Publication of CN115617951A publication Critical patent/CN115617951A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/18Legal services

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Business, Economics & Management (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Tourism & Hospitality (AREA)
  • Technology Law (AREA)
  • Economics (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • Character Discrimination (AREA)

Abstract

The application relates to a contract information extraction method, a contract information extraction device, a computer device, a medium and a program product, wherein the contract information can be automatically extracted by acquiring image data to be subjected to information extraction, performing text extraction on the image data to obtain target text data, then acquiring an extraction model obtained by training, and performing information extraction on the target text data according to the extraction model to obtain contract information, so that the extraction speed is increased, and the accuracy of the extracted contract information is ensured.

Description

Contract information extraction method, contract information extraction device, computer apparatus, contract information extraction medium, and program product
Technical Field
The present application relates to the field of electronic image contract processing technologies, and in particular, to a contract information extraction method, apparatus, computer device, medium, and program product.
Background
With the acceleration of modernization and informatization construction steps and the continuous upgrade of office requirements, paperless office concepts have penetrated into various industries. In enterprises, a large number of contract images need to be archived, and key information in the contract images needs to be archived and input, so that subsequent contract lookup and audit are facilitated.
At present, when contract information is extracted, a specially-assigned person is generally required to be arranged for completion, unified entry personnel are required to understand key information at first, then the key information in the contract is screened and entered, and in order to avoid extraction errors of the information, personnel are additionally arranged to check the extracted key information.
However, under the condition that the contract content is long, the workload of manual entry is increased by reading and searching information, and the long-time reading may cause visual and brain fatigue, which leads to inaccurate information search.
Disclosure of Invention
Based on this, it is necessary to provide a contract information extraction method, apparatus, computer device, medium, and program product capable of accurately extracting contract information in view of the above technical problems.
In a first aspect, the present application provides a contract information extraction method, including:
acquiring image data to be subjected to information extraction;
performing text extraction on the image data to obtain target text data;
obtaining an extraction model obtained by training;
and extracting information of the target text data according to the extraction model to obtain contract information.
In one embodiment, the extracting the text of the image data to obtain the target text data includes:
identifying initial text data corresponding to the image data;
sequencing all characters in the initial text data to obtain sequenced text data;
judging whether the sequenced text data meet semantic requirements or not;
if the sorted text data meet the semantic requirement, the sorted text data are used as text data corresponding to the image data;
and if the sorted text data do not meet the semantic requirement, returning to the step of identifying the initial text data corresponding to the image data.
In one embodiment, the sorting the characters in the initial text data includes:
acquiring character coordinates corresponding to each character in initial text data;
and sorting the characters according to the character coordinates.
In one embodiment, before the obtaining the trained extraction model, the method further includes:
acquiring sample image data;
performing text extraction on the sample image data to obtain sample text data;
preprocessing the sample text data to obtain preprocessed text data, wherein the preprocessing comprises at least one of data cleaning, data conversion and semantic screening;
carrying out data labeling processing on the preprocessed text data to obtain labeling categories corresponding to all characters in the preprocessed text data;
and carrying out model training according to the labeling type and the corresponding preprocessed text data to obtain an extraction model.
In one embodiment, the performing model training according to the labeling category and the corresponding preprocessed text data to obtain an extraction model includes:
inputting the preprocessed text data into the extraction model to obtain prediction categories, wherein the prediction categories comprise data categories corresponding to all characters in the preprocessed text data;
acquiring a category error corresponding to the prediction category and the labeling category;
and performing parameter optimization on the extraction model according to the category errors until the category errors reach the preset requirements.
In one embodiment, the extracting information of the target text data according to the extraction model to obtain the contract information includes:
inputting the target text data into the extraction model to obtain target data categories corresponding to all characters in the target text data;
and acquiring target text data corresponding to the target data type which is the same as the preset type to obtain contract information.
In a second aspect, the present application also provides a contract information extraction apparatus, including:
the image acquisition module is used for acquiring image data to be subjected to information extraction;
the text acquisition module is used for performing text extraction on the image data to obtain target text data;
the model acquisition module is used for acquiring an extraction model obtained by training;
and the information extraction module is used for extracting information of the target text data according to the extraction model to obtain contract information.
In a third aspect, the present application further provides a computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the method steps in any one of the first aspect when executing the computer program.
In a fourth aspect, the present application further provides a computer-readable storage medium. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method steps of any one of the first aspect.
In a fifth aspect, the present application also provides a computer program product comprising a computer program which, when executed by a processor, performs the method steps of any one of the first aspects.
According to the contract information extraction method, the contract information extraction device, the computer equipment, the medium and the program product, the image data to be subjected to information extraction is obtained, the image data is subjected to text extraction to obtain the target text data, then the extraction model obtained through training is obtained, and the information extraction is performed on the target text data according to the extraction model to obtain the contract information, so that the automatic extraction of the contract information can be realized, the extraction speed is increased, and the accuracy of the extracted contract information is ensured.
Drawings
FIG. 1 is a diagram of an application environment of a contract information extraction method in one embodiment;
FIG. 2 is a schematic flow chart diagram of a contract information extraction method in one embodiment;
FIG. 3 is a flowchart illustrating the step S202 in the embodiment shown in FIG. 2;
FIG. 4 is a schematic flow chart illustrating a contract information extraction method according to the embodiment shown in FIG. 2;
FIG. 5 is a flowchart illustrating the step S405 in the embodiment shown in FIG. 4;
FIG. 6 is a schematic flow chart diagram illustrating an intelligent contract information extraction method according to an embodiment;
FIG. 7 is a block diagram of the intelligent contract information extraction system in the embodiment shown in FIG. 6;
FIG. 8 is a block diagram showing the structure of a contract information extraction apparatus according to one embodiment;
FIG. 9 is a diagram of an internal structure of a computer device in one embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
The contract information extraction method provided by the embodiment of the application can be applied to the application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The data storage system may store data that the server 104 needs to process. The data storage system may be integrated on the server 104, or may be located on the cloud or other network server. The terminal 102 is configured to receive image data to be subjected to information extraction, and send the received image data to the server 104, and the server 104 is configured to perform text extraction on the image data to obtain target text data. The data storage system stores the extraction model obtained by training, and the server 104 is further configured to extract information of the target text data according to the extraction model to obtain contract information. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, and the like. The server 104 may be implemented as a stand-alone server or as a server cluster comprised of multiple servers.
In one embodiment, as shown in fig. 2, there is provided a contract information extraction method, including the steps of:
s201: and acquiring image data to be subjected to information extraction.
The image data refers to a contract image, and specifically, the contract image refers to all information in the contract which is represented in the form of an image. Contract images can be acquired by optical equipment such as a camera and the like, can be recorded and stored on paper media, films and other media sensitive to light signals, and with the development of digital acquisition technology and signal processing theory, more and more images are stored in a digital form, so that the term "image" actually refers to a digital image. In practical applications, the contract image data may be obtained through a network, such as a network platform or various electronic data sources such as an online library, and similarly, if the contract image data is in an enterprise, the contract image data in the enterprise may be used. However, all contract images acquired through the network are not clear images, so that the contract images with poor quality are generally required to be manually removed, for example, the contract images are distorted and not clear, and the finally obtained contract images can be used for information extraction.
S202: and performing text extraction on the image data to obtain target text data.
The text extraction of the image data refers to converting the image data into text data, specifically, text features in the image can be identified through an image identification technology, the image identification technology is mainly based on main features of the image, and identification of complex images can be realized only through information processing of different levels in an image identification system. For a familiar figure, which is recognized as a unit by grasping its main characteristics without paying attention to its details, such an integral unit composed of isolated unit materials is called a block, and each block is sensed at the same time. In the recognition of the character material, not only the units of strokes or components of a Chinese character can be combined into a block, but also the frequently occurring words or phrases can be recognized as the block unit. In practical applications, since contract images are often stored in a Portable Document Format (PDF) or a bitmap Format (TIFF) with multiple pages, a File needs to be split into multiple single pages and stored as a picture type, and then an Optical Character Recognition (OCR) method is used to recognize picture information, extract text content in the picture, and convert Image data into text data.
OCR refers to a process in which an electronic device (e.g., a scanner or a digital camera) examines characters printed on paper, determines their shapes by detecting dark and light patterns, and then translates the shapes into computer characters using a character recognition method; the method is characterized in that characters in a paper document are converted into an image file with a black-white dot matrix in an optical mode aiming at print characters, and the characters in the image are converted into a text format through recognition software for further editing and processing by word processing software.
S203: and obtaining an extraction model obtained by training.
The extraction model is trained in advance, firstly, data labeling is carried out on characters in target text data, the characters are labeled into different categories, certain text data is selected as sample data, the sample data is used for training the extraction model, specifically, the sample data is input into the extraction model, prediction categories of all the characters in the sample data are output, the prediction categories are used for expressing the probability that the characters are of one of the labeled categories, the prediction categories of all the characters are compared with the labeled categories to obtain current model errors, and parameters of the extraction model are adjusted through the errors until the extraction model training is completed. In practical application, whether the extraction model is trained completely or not can be judged by setting a training stopping condition, for example, when an error is smaller than a certain value, the extraction model is considered to be trained completely, or model scores of the extraction model after each training are respectively calculated through an evaluation function and compared with model scores after the previous training, and a model with the highest model score is taken as the extraction model after the training.
S204: and extracting information of the target text data according to the extraction model to obtain contract information.
The target text data is used as the input of the extraction model, the extraction model can output the probability that each character in the target text data is in a preset category, the preset category refers to all categories after data labeling, and is equivalent to the probability that a certain character is output in a certain category, and when the category of the text needing to be extracted is determined, the text character meeting the requirements can be obtained from the output result of the extraction model. For example, the probability that the first character in the target text data is in the first category is eighty percent, the probability that the second character in the target text data is in the first category is sixty percent, if the text character in the first category needs to be extracted, a corresponding limiting condition may be set, and if the condition that the probability exceeds seventy percent is considered to be met, the extracted information is the first character.
According to the contract information extraction method, the image data to be subjected to information extraction is obtained, the text extraction is carried out on the image data to obtain the target text data, then the extraction model obtained through training is obtained, the information extraction is carried out on the target text data according to the extraction model to obtain the contract information, the automatic extraction of the contract information can be realized, the extraction speed is improved, and meanwhile, the accuracy of the extracted contract information is guaranteed.
In an embodiment, as shown in fig. 3, the extracting the text of the image data to obtain the target text data includes:
s301: and identifying initial text data corresponding to the image data.
Specifically, the file of the image data is divided by pages and stored as a picture type, and the contract images are identified page by using the OCR technology to obtain the text content in each page of image.
S302: and sequencing all characters in the initial text data to obtain the sequenced text data.
The recognized characters do not necessarily accord with the human reading sequence or do not have natural semantic information, so that the characters need to be sorted according to the coordinates, the characters are restored to the reading sequence in the contract image, subsequent information extraction is facilitated, and meanwhile the accuracy of information extraction can be guaranteed.
S303: judging whether the sequenced text data meet semantic requirements or not; if the sorted text data meet the semantic requirement, the sorted text data are used as the text data corresponding to the image data; and if the sorted text data do not meet the semantic requirement, returning to the step of identifying the initial text data corresponding to the image data.
The semantic requirements refer to text data with natural semantic information capable of accurately representing contract image information, specifically, after characters are obtained through OCR recognition and sorted, although natural semantics can be recovered, a large number of errors exist in partial contract image recognition results, or natural semantics still cannot be recovered from sorted texts due to problems such as excessive inclination, and the original semantic information of the data is lost and can be understood as texts which do not meet actual conditions.
In this embodiment, the initial text data corresponding to the image data is identified, the characters in the initial text data are sorted to obtain the sorted text data, and whether the sorted text data meets the semantic requirement is determined, so that the identified text data can have natural semantic information, and the accuracy of contract information extraction is ensured.
In an embodiment, the sorting the characters in the initial text data includes: acquiring character coordinates corresponding to each character in initial text data; and sorting the characters according to the character coordinates.
The character coordinates are synchronously recognized when the text data in the contract image is recognized through an OCR recognition technology, and for each character in the contract image, the coordinate position of the character in the image can be recognized, so that the characters are sorted according to the character coordinates. Specifically, the horizontal direction is regarded as an x axis and is increased from left to right, and the vertical direction is regarded as a y axis and is increased from top to bottom; the same line of contents are uniformly stored according to the y coordinate, so that all the text contents are split according to the line, and whether the text contents are in the same line can be judged by comparing the inclination of the text with a little inclination; sequencing the contents of each row from small to large according to the coordinate x, and recovering the natural semantic information of the contents of each row; and finally, sequencing all the line contents from small to large according to the y value to obtain the text contents with natural semantic information.
In this embodiment, the character coordinates corresponding to each character in the initial text data are obtained, and the characters are sorted according to the character coordinates, so that the recognized text data can be ensured to have certain significance, and the accuracy of contract information extraction is ensured.
In an embodiment, as shown in fig. 4, before the obtaining of the trained extraction model, the method further includes:
s401: sample image data is acquired.
The sample image data is used for training the extraction model, the sample image data can be contract image data acquired through a network, or can be a part of image data to be subjected to information extraction, but the sample influence data requirement is not identical to the image data to be subjected to information extraction.
S402: and performing text extraction on the sample image data to obtain sample text data.
The steps of extracting the text of the sample image data and the steps of extracting the text of the image data to be subjected to information extraction are the same, and the text data in the image data are identified through an OCR (optical character recognition) technology.
S403: and preprocessing the sample text data to obtain preprocessed text data.
The method comprises the steps of obtaining text data through an OCR (optical character recognition) technology, wherein abnormal data exist in the text data, so that sample text data need to be preprocessed, wherein the preprocessing comprises at least one of data cleaning, data conversion and semantic screening, the data cleaning refers to removing texts (such as characters in a natural language) which do not accord with actual conditions in the sample text data, the data conversion refers to unifying the case of the sample text data, and the semantic screening refers to removing text data with obviously wrong semantics.
S404: and carrying out data labeling processing on the preprocessed text data to obtain labeling categories corresponding to all characters in the preprocessed text data.
In practical application, in order to convert text data into contents which can be identified by an extraction model, the preprocessed text data needs to be labeled according to a training data input format requirement of a Named Entity identification (NER) model, specifically, a first character of the contents is labeled as a 'B-' category, other characters of the categories are labeled as 'I-' categories, and other characters are labeled as 'O'. Specifically, taking text data "Xiaoming goes to school at 9 am every day" as an example, the content category to be extracted includes two categories of "person name" and "place", where the person name may be represented as "P" and the place may be represented as "L", and correspondingly, the "person name" and the "learning" in the text data, and thus, the corresponding data is labeled as "B-P/I-P/O/B-L/I-L", where "B-P" represents the first character of the content category "person name", I-P "represents the remaining characters of the content category" person name ", O" represents the remaining characters of the content category not "person name" or "place", B-L "represents the first character of the content category" place ", and" I-L "represents the remaining characters of the content category" place ".
The NLP is a technology for performing interactive communication with a machine using a natural language used for human communication, and a computer can read and understand the natural language through artificial processing of the natural language. In brief, the basic task of natural language processing is to perform word segmentation on a corpus to be processed based on an ontology dictionary, word frequency statistics, context semantic analysis and other modes to form a term unit which takes the minimum part of speech as a unit and is rich in semantics. NER is a very fundamental task in NLP, a subtask of information extraction that aims to locate and classify named entities in text into predefined categories such as people, organizations, locations, temporal expressions, quantities, monetary values, percentages, etc.
S405: and performing model training according to the labeling type and the corresponding preprocessed text data to obtain an extraction model.
The marked preprocessed text data is input into an extraction model, the extraction model can output a prediction category corresponding to each character, the text data is named as ' Mingming 9 o ' clock to school ' every morning, the marking categories comprise ' name ' and ' place ', the extraction model can output the probability that each character is named as ' name ' or ' place ' after the text data is input into the extraction model, the probability is used as the prediction probability of the character, and model training is carried out on the extraction model by combining the marking categories of the character.
In the embodiment, sample image data is obtained, text extraction is performed on the sample image data to obtain sample text data, then the sample text data is preprocessed to obtain preprocessed text data, data labeling processing is performed on the preprocessed text data to obtain labeling categories corresponding to characters in the preprocessed text data, and finally model training is performed according to the labeling categories and the corresponding preprocessed text data to obtain an extraction model.
In an embodiment, as shown in fig. 5, the performing model training according to the labeled category and the corresponding preprocessed text data to obtain an extraction model includes:
s501: and inputting the preprocessed text data into the extraction model to obtain the prediction category.
The prediction categories comprise data categories corresponding to characters in preprocessed text data, specifically, the input text data is converted into languages which can be read by a machine through a word vector model, then context semantic information is learned through a feature extraction module, probabilities of different categories of predictions of each word vector are obtained, and finally, the prediction results are classified through a classification layer to obtain the prediction results of each character. Taking text data of 'Xiaoming goes to school at 9 am every day' as an example, combining context, respectively outputting the probabilities that the character 'Xiaoming' is a 'name' and a 'place' category, and obtaining the prediction category of the character 'Xiaoming' as the 'name'.
S502: and acquiring a category error corresponding to the prediction category and the labeling category.
However, when the prediction type is obtained by the word vector model, there may be a slight error, for example, the type of the character "day" is predicted as "name", which indicates that the probability that the character "day" is "name" is high, but actually the probability that the character "day" is "name" should be low, and therefore, the type error corresponding to the prediction type and the label type can be obtained.
S503: and performing parameter optimization on the extraction model according to the category error until the category error reaches the preset requirement.
And adjusting parameters of the extraction model according to the category error, training the extraction model until the category error meets a preset requirement, for example, the category error is reduced to a certain value, indicating that the current extraction model reaches sufficient precision, and finishing the model training.
In the embodiment, the preprocessed text data is input into the extraction model to obtain the prediction category, the category error corresponding to the prediction category and the labeling category is obtained, the parameter optimization is performed on the extraction model according to the category error until the category error meets the preset requirement, the model precision of the extraction model can be ensured, and the accuracy of contract information extraction is ensured.
In an embodiment, the extracting information of the target text data according to the extraction model to obtain the contract information includes: inputting the target text data into the extraction model to obtain target data categories corresponding to all characters in the target text data; and acquiring target text data corresponding to the target data type which is the same as the preset type to obtain contract information.
The method comprises the steps that target text data are input into an extraction model, the extraction model can output target data categories, namely prediction categories, corresponding to each character in the target text data, contract information comprises various key information such as a first party, a second party, money, signing date and the like, when certain category of key information needs to be extracted, a preset category can be determined, then the target text data which are the same as the preset category are obtained from the target data categories, and the extracted text information is contract information which needs to be extracted.
In this embodiment, the target text data is input to the extraction model to obtain the target data category corresponding to each character in the target text data, the target text data corresponding to the target data category that is the same as the preset category is obtained to obtain the contract information, the required contract information can be accurately obtained through the prediction model, and the accuracy of contract information extraction is ensured.
In one embodiment, as shown in fig. 6, there is provided an intelligent extraction method of contract information based on OCR and NLP, the method comprising:
(1) Step 101: constructing a contract image library: contract image data, such as various electronic data sources of news, microblogs, online libraries and the like, are acquired by utilizing a network approach. Similarly, contract image gathering can be accomplished using contract data within the enterprise, if any. And then manually eliminating contract images with poor quality, such as distortion and unclear conditions, and finally obtaining a contract image library.
(2) Step 102: extracting image contract characters: the OCR recognition technology is utilized to convert the optical characters in the image into the electronic document, and the processing method comprises the following steps: splitting a plurality of pages of PDF and TIFF files into single pages; OCR identifies contract images page by page, and extracts character content; and sorting the recognition results according to the character coordinates, and recovering the recognized characters into text contents with reading semantics. Thereby converting the optical characters in the image contract into the electronic document with natural semantics.
(3) Step 201: determining the extraction content: and analyzing the contract contents in the contract image library to determine the key content category to be extracted.
(4) Step 103: preprocessing a model training corpus: preprocessing the NLP text corpus to obtain training data of the model, wherein the training data comprises the following steps: and (4) performing data cleaning on the sequenced texts which do not accord with the actual situation, performing data annotation according to the content to be extracted, processing stop words, and converting upper case to lower case. Because the text is obtained by OCR recognition, the natural semantics is restored by sequencing the characters and the coordinates, but a large number of errors exist in the recognition result of part of contract images, or the natural semantics can not be restored in the sequenced text due to the problems of excessive inclination and the like, the original semantic information of the data is lost, and the data can be understood as the text which is not in accordance with the actual situation and needs to be manually removed.
(5) Step 104: training an information extraction model: and (5) inputting an information extraction model for training by using the contract training corpora obtained by preprocessing in the step 104. The main training process is as follows: the input text is converted by the word vector model into a machine readable language: a word vector; the information extraction module learns the context semantics through the word vectors and carries out category prediction on each word vector, so that the prediction result of each character is obtained. And comparing the prediction type of each character with the labeling type, and continuously modifying the prediction result by the model according to the result error until a satisfactory result is obtained.
Specifically, the extraction model adopted is an NER model, and since the processed contract is usually in chinese, a chinese pre-training model based on a Bidirectional Encoder Representation (BERT) of a transform is used as a language representation model to generate word vectors based on text context information; because of the existence of Long text in the contract, the model needs to combine Long context information, and a bidirectional Long Short-Term Memory network (BilSTM) is used for modeling the context information. The BERT is used as the extracted features of the embedding layer to be transmitted into the BilSTM, and finally the BilSTM outputs the prediction score of each label for embedding characters or words, and the scores are finally used as the input of the classification layer. Wherein the final label prediction task can be completed using the CRF, because the CRF layer can add some constraints to the final predicted label to ensure the validity of the predicted label, and the constraints CRF can be automatically learned in the training process.
(6) Step 105: partial output content formatting (optional): if the format of the archived information has certain requirements, the content can be processed in a unified format, for example, the date format is unified into yyyy-mm-dd format, so that the consistency of the information is convenient to consult.
In the embodiment, the key information extraction method of the NLP is used for replacing manual information extraction and entry, so that the efficiency of information extraction and entry is remarkably improved, the investment of manpower and resources is reduced, and the efficiency of contract information auditing is facilitated to be accelerated.
It should be understood that, although the steps in the flowcharts related to the above embodiments are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a part of the steps in the flowcharts related to the above embodiments may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least a part of the steps or stages in other steps.
In one embodiment, as shown in fig. 7, there is provided an intelligent contract information extraction system based on OCR and NLP, comprising: the system comprises an acquisition module 710, an OCR recognition module 720, a corpus processing module 730, an information extraction module 740, and a post-processing module 750, wherein:
an obtaining module 710, configured to obtain contract images from various data sources;
the OCR recognition module 720 is used for splitting the TIFF and the PDF into single-page pictures; recognizing optical characters in the image page by page as electronic texts, and sequencing the characters into human reading word order by a coordinate sequencing technology;
the corpus processing module 730 is used for performing data preprocessing on the sequenced texts, including processing of stop words, capitalization-to-lowercase conversion and the like, and finally forming a normalized NLP text corpus;
the information extraction module 740 mainly includes three sub-modules: a word vector generation module 741, a similarity obtaining module 742, and a label prediction module 743, wherein:
a word vector generation module 741, configured to input the text corpus into a word vector model for training to obtain a word vector model, where if the selected model has a corresponding pre-training model, the pre-training model may be directly used;
a similarity obtaining module 742, configured to transmit the word vector model as an embedded layer extracted feature to the similarity obtaining module, and output a prediction score of each preset category for each word/word embedding, where the prediction score can visually represent a possibility that the character is in each category, and the larger the score predicted for a certain category is, the higher the possibility that the character is in the category is;
and a label prediction module 743, configured to perform label prediction on the prediction score output by the module similarity module, so as to obtain a final character category.
And the post-processing module 750 is used for performing unified post-processing on the extraction result according to the input requirement.
Based on the same inventive concept, the embodiment of the present application further provides a contract information extraction apparatus for implementing the contract information extraction method. The implementation scheme for solving the problem provided by the device is similar to the implementation scheme described in the method, so specific limitations in one or more embodiments of the contract information extraction device provided below can be referred to the limitations on the contract information extraction method in the foregoing, and details are not described here.
In one embodiment, as shown in fig. 8, there is provided a contract information extraction apparatus including: the image acquisition module 10, the text acquisition module 20, the model acquisition module 30 and the information extraction module 40, wherein:
an image obtaining module 10, configured to obtain image data to be subjected to information extraction;
the text acquisition module 20 is configured to perform text extraction on the image data to obtain target text data;
a model obtaining module 30, configured to obtain an extraction model obtained through training;
and the information extraction module 40 is used for extracting information of the target text data according to the extraction model to obtain contract information.
In one embodiment, the text acquiring module includes: data acquisition unit, character sequencing unit and semantic judgement unit, wherein:
the data acquisition unit is used for identifying initial text data corresponding to the image data;
the character sorting unit is used for sorting all characters in the initial text data to obtain sorted text data;
the semantic judging unit is used for judging whether the sequenced text data meet semantic requirements or not; if the sorted text data meet the semantic requirement, the sorted text data are used as the text data corresponding to the image data; and if the sorted text data do not meet the semantic requirement, returning to the step of identifying the initial text data corresponding to the image data.
In one embodiment, the character sorting unit includes: a coordinate acquisition subunit and a character sorting subunit, wherein:
the coordinate acquisition subunit is used for acquiring character coordinates corresponding to each character in the initial text data;
and the character sorting subunit is used for sorting the characters according to the character coordinates.
In one embodiment, the model obtaining module further includes: the system comprises a sample acquisition unit, a text extraction unit, a text processing unit, a data labeling unit and a model training unit, wherein:
the sample acquisition unit is used for acquiring sample image data;
the text extraction unit is used for performing text extraction on the sample image data to obtain sample text data;
the text processing unit is used for preprocessing the sample text data to obtain preprocessed text data, and the preprocessing comprises at least one of data cleaning, data conversion and semantic screening;
the data labeling unit is used for performing data labeling processing on the preprocessed text data to obtain labeling categories corresponding to all characters in the preprocessed text data;
and the model training unit is used for carrying out model training according to the labeling type and the corresponding preprocessed text data to obtain an extraction model.
In one embodiment, the model training unit includes: a category acquisition subunit, an error acquisition subunit, and a parameter optimization subunit, wherein:
the category acquisition subunit is used for inputting the preprocessed text data into the extraction model to obtain a prediction category, wherein the prediction category comprises a data category corresponding to each character in the preprocessed text data;
the error acquisition subunit is used for acquiring the category errors corresponding to the prediction categories and the labeling categories;
and the parameter optimization subunit is used for performing parameter optimization on the extraction model according to the category error until the category error meets the preset requirement.
In one embodiment, the information extracting module includes: a target category acquisition unit and a contract information acquisition unit, wherein:
the target type obtaining unit is used for inputting the target text data into the extraction model to obtain target data types corresponding to all characters in the target text data;
and the contract information acquisition unit is used for acquiring target text data corresponding to the target data type which is the same as the preset type to obtain contract information.
The modules in the contract information extraction device can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 9. The computer apparatus includes a processor, a memory, an input/output interface, a communication interface, a display unit, and an input device. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface, the display unit and the input device are connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operating system and the computer program to run on the non-volatile storage medium. The input/output interface of the computer device is used for exchanging information between the processor and an external device. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless communication can be realized through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a contract information extraction method. The display unit of the computer device is used for forming a visual picture and can be a display screen, a projection device or a virtual reality imaging device. The display screen can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.
Those skilled in the art will appreciate that the architecture shown in fig. 9 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer device is provided, comprising a memory having a computer program stored therein and a processor that when executing the computer program performs the steps of: acquiring image data to be subjected to information extraction; performing text extraction on the image data to obtain target text data; obtaining an extraction model obtained by training; and extracting information of the target text data according to the extraction model to obtain contract information.
In one embodiment, the text extraction of the image data to obtain the target text data involved in the execution of the computer program by the processor comprises: identifying initial text data corresponding to the image data; sequencing all characters in the initial text data to obtain sequenced text data; judging whether the sequenced text data meet semantic requirements or not; if the sorted text data meet the semantic requirement, the sorted text data are used as the text data corresponding to the image data; and if the sorted text data do not meet the semantic requirement, returning to the step of identifying the initial text data corresponding to the image data.
In one embodiment, the ordering of characters in the initial text data involved in execution of the computer program by the processor comprises: acquiring character coordinates corresponding to each character in initial text data; and sorting the characters according to the character coordinates.
In one embodiment, before the obtaining the trained extraction model, the processor further comprises: acquiring sample image data; performing text extraction on the sample image data to obtain sample text data; preprocessing the sample text data to obtain preprocessed text data, wherein the preprocessing comprises at least one of data cleaning, data conversion and semantic screening; carrying out data labeling processing on the preprocessed text data to obtain labeling categories corresponding to all characters in the preprocessed text data; and carrying out model training according to the labeling type and the corresponding preprocessed text data to obtain an extraction model.
In one embodiment, the performing model training according to the labeled categories and the corresponding preprocessed text data to obtain the extracted model when the processor executes the computer program includes: inputting the preprocessed text data into an extraction model to obtain prediction categories, wherein the prediction categories comprise data categories corresponding to all characters in the preprocessed text data; acquiring a category error corresponding to the prediction category and the labeling category; and performing parameter optimization on the extraction model according to the category error until the category error reaches the preset requirement.
In one embodiment, the information extraction of the target text data according to the extraction model to obtain the contract information is involved in the execution of the computer program by the processor, and comprises: inputting the target text data into the extraction model to obtain target data categories corresponding to all characters in the target text data; and acquiring target text data corresponding to the target data type which is the same as the preset type to obtain contract information.
In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of: acquiring image data to be subjected to information extraction; performing text extraction on the image data to obtain target text data; obtaining an extraction model obtained by training; and extracting information of the target text data according to the extraction model to obtain contract information.
In one embodiment, the computer program, when executed by the processor, involves performing text extraction on the image data to obtain target text data, comprising: identifying initial text data corresponding to the image data; sequencing all characters in the initial text data to obtain sequenced text data; judging whether the sequenced text data meet semantic requirements or not; if the sorted text data meet the semantic requirement, the sorted text data are used as the text data corresponding to the image data; and if the sorted text data do not meet the semantic requirement, returning to the step of identifying the initial text data corresponding to the image data.
In one embodiment, the computer program, when executed by the processor, relates to sorting characters in the initial text data, comprising: acquiring character coordinates corresponding to each character in the initial text data; and sorting the characters according to the character coordinates.
In one embodiment, the computer program, before being executed by the processor, involves obtaining a trained extraction model, further comprising: acquiring sample image data; performing text extraction on the sample image data to obtain sample text data; preprocessing the sample text data to obtain preprocessed text data, wherein the preprocessing comprises at least one of data cleaning, data conversion and semantic screening; carrying out data labeling processing on the preprocessed text data to obtain labeling categories corresponding to all characters in the preprocessed text data; and carrying out model training according to the labeling type and the corresponding preprocessed text data to obtain an extraction model.
In one embodiment, the computer program, when executed by a processor, is directed to model training based on the labeled categories and corresponding preprocessed textual data to obtain an extraction model, comprising: inputting the preprocessed text data into an extraction model to obtain prediction categories, wherein the prediction categories comprise data categories corresponding to all characters in the preprocessed text data; acquiring a category error corresponding to the prediction category and the labeling category; and performing parameter optimization on the extraction model according to the category error until the category error reaches the preset requirement.
In one embodiment, the computer program, when executed by the processor, involves information extraction of target text data according to an extraction model to obtain contract information, comprising: inputting the target text data into the extraction model to obtain target data categories corresponding to all characters in the target text data; and acquiring target text data corresponding to the target data type which is the same as the preset type to obtain contract information.
In one embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, performs the steps of: acquiring image data to be subjected to information extraction; performing text extraction on the image data to obtain target text data; obtaining an extraction model obtained by training; and extracting information of the target text data according to the extraction model to obtain contract information.
In one embodiment, the computer program, when executed by the processor, involves performing text extraction on the image data to obtain target text data, comprising: identifying initial text data corresponding to the image data; sequencing all characters in the initial text data to obtain sequenced text data; judging whether the sequenced text data meet semantic requirements or not; if the sorted text data meet the semantic requirement, the sorted text data are used as text data corresponding to the image data; and if the sorted text data do not meet the semantic requirement, returning to the step of identifying the initial text data corresponding to the image data.
In one embodiment, the computer program, when executed by the processor, relates to sorting characters in the initial text data, comprising: acquiring character coordinates corresponding to each character in initial text data; and sorting the characters according to the character coordinates.
In one embodiment, the computer program, before being executed by the processor, involves obtaining a trained extraction model, further comprising: acquiring sample image data; performing text extraction on the sample image data to obtain sample text data; preprocessing the sample text data to obtain preprocessed text data, wherein the preprocessing comprises at least one of data cleaning, data conversion and semantic screening; carrying out data labeling processing on the preprocessed text data to obtain labeling categories corresponding to all characters in the preprocessed text data; and performing model training according to the labeling type and the corresponding preprocessed text data to obtain an extraction model.
In one embodiment, the computer program, when executed by a processor, is directed to model training based on the labeled categories and corresponding preprocessed textual data to obtain an extraction model, comprising: inputting the preprocessed text data into the extraction model to obtain prediction categories, wherein the prediction categories comprise data categories corresponding to all characters in the preprocessed text data; acquiring a category error corresponding to the prediction category and the labeling category; and performing parameter optimization on the extraction model according to the category errors until the category errors reach the preset requirements.
In one embodiment, the computer program, when executed by the processor, involves information extraction of target text data according to an extraction model to obtain contract information, comprising: inputting the target text data into the extraction model to obtain target data categories corresponding to all characters in the target text data; and acquiring target text data corresponding to the target data type which is the same as the preset type to obtain contract information.
It should be noted that, the contract information (including but not limited to user personal information, user business information, etc.) and data (including but not limited to data for analysis, stored data, displayed data, etc.) referred to in this application are information and data authorized by the user or fully authorized by each party, and the collection, use and processing of the related data need to comply with the relevant laws and regulations and standards of the relevant countries and regions.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by hardware that is instructed by a computer program, and the computer program may be stored in a non-volatile computer-readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. Any reference to memory, databases, or other media used in the embodiments provided herein can include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high-density embedded nonvolatile Memory, resistive Random Access Memory (ReRAM), magnetic Random Access Memory (MRAM), ferroelectric Random Access Memory (FRAM), phase Change Memory (PCM), graphene Memory, and the like. Volatile Memory can include Random Access Memory (RAM), external cache Memory, and the like. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), for example. The databases referred to in various embodiments provided herein may include at least one of relational and non-relational databases. The non-relational database may include, but is not limited to, a block chain based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic devices, quantum computing based data processing logic devices, etc., without limitation.
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above examples only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, and these are all within the scope of protection of the present application. Therefore, the protection scope of the present application shall be subject to the appended claims.

Claims (10)

1. A contract information extraction method, characterized in that the method comprises:
acquiring image data to be subjected to information extraction;
performing text extraction on the image data to obtain target text data;
obtaining an extraction model obtained by training;
and extracting information of the target text data according to the extraction model to obtain contract information.
2. The method of claim 1, wherein the extracting the image data to obtain target text data comprises:
identifying initial text data corresponding to the image data;
sequencing all characters in the initial text data to obtain sequenced text data;
judging whether the sequenced text data meet semantic requirements or not;
if the sorted text data meet the semantic requirement, taking the sorted text data as text data corresponding to the image data;
and if the sorted text data do not meet the semantic requirement, returning to the step of identifying the initial text data corresponding to the image data.
3. The method of claim 2, wherein the sorting the characters in the initial text data comprises:
acquiring character coordinates corresponding to each character in the initial text data;
and sequencing the characters according to the character coordinates.
4. The method of claim 1, wherein before obtaining the trained extraction model, further comprising:
acquiring sample image data;
performing text extraction on the sample image data to obtain sample text data;
preprocessing the sample text data to obtain preprocessed text data, wherein the preprocessing comprises at least one of data cleaning, data conversion and semantic screening;
carrying out data labeling processing on the preprocessed text data to obtain labeling categories corresponding to all characters in the preprocessed text data;
and carrying out model training according to the labeling type and the corresponding preprocessed text data to obtain an extraction model.
5. The method of claim 4, wherein performing model training to obtain an extraction model according to the label categories and the corresponding preprocessed text data comprises:
inputting the preprocessed text data into an extraction model to obtain prediction categories, wherein the prediction categories comprise data categories corresponding to all characters in the preprocessed text data;
acquiring a category error corresponding to the prediction category and the labeling category;
and performing parameter optimization on the extraction model according to the category error until the category error reaches a preset requirement.
6. The method of claim 5, wherein the extracting information from the target text data according to the extraction model to obtain contract information comprises:
inputting the target text data into the extraction model to obtain target data categories corresponding to all characters in the target text data;
and acquiring target text data corresponding to the target data type which is the same as the preset type to obtain contract information.
7. A contract information extraction apparatus, characterized in that the apparatus comprises:
the image acquisition module is used for acquiring image data to be subjected to information extraction;
the text acquisition module is used for performing text extraction on the image data to obtain target text data;
the model acquisition module is used for acquiring an extraction model obtained by training;
and the information extraction module is used for extracting information of the target text data according to the extraction model to obtain contract information.
8. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 6.
9. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 6.
10. A computer program product comprising a computer program, characterized in that the computer program realizes the steps of the method of any one of claims 1 to 6 when executed by a processor.
CN202211360785.4A 2022-11-02 2022-11-02 Contract information extraction method, contract information extraction device, computer apparatus, contract information extraction medium, and program product Pending CN115617951A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211360785.4A CN115617951A (en) 2022-11-02 2022-11-02 Contract information extraction method, contract information extraction device, computer apparatus, contract information extraction medium, and program product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211360785.4A CN115617951A (en) 2022-11-02 2022-11-02 Contract information extraction method, contract information extraction device, computer apparatus, contract information extraction medium, and program product

Publications (1)

Publication Number Publication Date
CN115617951A true CN115617951A (en) 2023-01-17

Family

ID=84876056

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211360785.4A Pending CN115617951A (en) 2022-11-02 2022-11-02 Contract information extraction method, contract information extraction device, computer apparatus, contract information extraction medium, and program product

Country Status (1)

Country Link
CN (1) CN115617951A (en)

Similar Documents

Publication Publication Date Title
US10915788B2 (en) Optical character recognition using end-to-end deep learning
CN110580308B (en) Information auditing method and device, electronic equipment and storage medium
CN112396049A (en) Text error correction method and device, computer equipment and storage medium
CN112434691A (en) HS code matching and displaying method and system based on intelligent analysis and identification and storage medium
US11379690B2 (en) System to extract information from documents
CN114298035A (en) Text recognition desensitization method and system thereof
CN115862040A (en) Text error correction method and device, computer equipment and readable storage medium
CN115937887A (en) Method and device for extracting document structured information, electronic equipment and storage medium
CN112149680A (en) Wrong word detection and identification method and device, electronic equipment and storage medium
CN111008624A (en) Optical character recognition method and method for generating training sample for optical character recognition
CN112464927B (en) Information extraction method, device and system
CN113868419A (en) Text classification method, device, equipment and medium based on artificial intelligence
CN113642569A (en) Unstructured data document processing method and related equipment
CN112418813A (en) AEO qualification intelligent rating management system and method based on intelligent analysis and identification and storage medium
CN112307749A (en) Text error detection method and device, computer equipment and storage medium
CN112036330A (en) Text recognition method, text recognition device and readable storage medium
CN116860747A (en) Training sample generation method and device, electronic equipment and storage medium
CN115984886A (en) Table information extraction method, device, equipment and storage medium
CN116030469A (en) Processing method, processing device, processing equipment and computer readable storage medium
CN115690816A (en) Text element extraction method, device, equipment and medium
CN115393870A (en) Text information processing method, device, equipment and storage medium
CN115880702A (en) Data processing method, device, equipment, program product and storage medium
CN115294593A (en) Image information extraction method and device, computer equipment and storage medium
CN115617951A (en) Contract information extraction method, contract information extraction device, computer apparatus, contract information extraction medium, and program product
CN117494688B (en) Form information extraction method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination