CN109816118B - Method and terminal for creating structured document based on deep learning model - Google Patents

Method and terminal for creating structured document based on deep learning model Download PDF

Info

Publication number
CN109816118B
CN109816118B CN201910074243.2A CN201910074243A CN109816118B CN 109816118 B CN109816118 B CN 109816118B CN 201910074243 A CN201910074243 A CN 201910074243A CN 109816118 B CN109816118 B CN 109816118B
Authority
CN
China
Prior art keywords
document
deep learning
learning model
information
picture
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910074243.2A
Other languages
Chinese (zh)
Other versions
CN109816118A (en
Inventor
黄征
陈凯
周曲
周异
何建华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen Shangji Network Technology Co ltd
Shanghai Shenyao Intelligent Technology Co ltd
Original Assignee
Xiamen Shangji Network Technology Co ltd
Shanghai Shenyao Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen Shangji Network Technology Co ltd, Shanghai Shenyao Intelligent Technology Co ltd filed Critical Xiamen Shangji Network Technology Co ltd
Priority to CN201910074243.2A priority Critical patent/CN109816118B/en
Publication of CN109816118A publication Critical patent/CN109816118A/en
Application granted granted Critical
Publication of CN109816118B publication Critical patent/CN109816118B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Machine Translation (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a method and a terminal for creating a structured document based on a deep learning model, and belongs to the field of data processing. The method comprises the steps of presetting a training sample set; each sample in the training sample set comprises a document picture and an annotated document corresponding to the document picture; the annotation document records the position information and the category information of each key field in the document picture; training a preset first deep learning model by using the training sample set to obtain a second deep learning model; the second deep learning model analyzes a first document picture to obtain the position information and the category information of each key field in the first document picture; and creating a structured document corresponding to the first document picture according to the position information and the category information of each key field in the first document picture. The accuracy of converting the document pictures into the structured documents is improved.

Description

Method and terminal for creating structured document based on deep learning model
Technical Field
The invention relates to a method and a terminal for creating a structured document based on a deep learning model, and belongs to the field of data processing.
Background
Document structuring is a process of extracting key field information from a large amount of text information of a document, such as a payer, a payment date, a payee and the like in a receipt, and storing the key field information according to a certain structure. After a large number of documents are processed through document structuring, services such as efficient document retrieval, document analysis and other intelligentization can be provided. The key of document structuring is also the main technical difficulty in extracting key field information from a large number of words, including determining the position of a required key field in a document and identifying the positioned words.
For some document structuring applications with large traffic and high accuracy requirements, such as invoice reimbursement, bank checkout, etc., many critical tasks in the document structuring system are performed manually. The workflow of a human-based document structuring system is shown in FIG. 1 and includes manually locating fields, manually identifying field text, and entering the identified text into corresponding fields in the archived structured document. Although the manual positioning of the fields and the manual recognition of the text have high accuracy, the document structuring system based on manual operation has many defects, such as slow manual recognition speed, high labor cost, easily affected performance by fatigue and other factors, requiring additional text input time, easily bringing additional errors in text input, and the like, and is not favorable for establishing a large-scale, efficient and economical document structuring system.
With the rapid development of information processing technology, especially deep learning technology in recent years, the character positioning and character recognition performance is greatly improved, the accuracy of character recognition in some fields is close to the level of manual recognition, and the method helps to realize landing of various scene applications. Deep learning techniques are also applied to document structuring systems. At present, a document structuring scheme using deep learning technology, whose working flow is shown in fig. 2, includes the following basic steps: determining the fixed positions of different key fields in the documents by carrying out template analysis and statistics on a large number of documents; preprocessing the document to be structured, and if the document is not a digital document, preprocessing the document to be structured and stored as a digital image; carrying out normalization alignment processing on the position of the content of the key field; intercepting an image corresponding to the field from the document to be processed according to the fixed position corresponding to different key fields; recognizing characters by utilizing a deep learning OCR technology; and automatically storing the recognized characters into corresponding fields of the structured document.
According to the existing deep learning technical scheme, a field positioning task is simplified into the process of intercepting an image corresponding to a field from a fixed position in the image, characters are recognized by utilizing a deep learning OCR technology, full automation is realized on a key task, and the calculation efficiency is greatly improved. However, this document structuring system is effective only in the case where the position of the field to be intercepted is fixed in all documents, and limits the range of use of the system. In practical application, if the invoice printing system sets different content printing positions of the key fields or the content lengths of the key fields are changed, the content information of the key fields is deviated and exceeds the setting range, and thus errors are caused. For some bill identification applications, a large number of bills are stored in a computer through scanning or mobile phone photographing and other modes, so that the displacement of the bills in an image is easily caused, different bills can have different formats, and the positions of the same field in the image are not necessarily the same. The document structuring scheme has low accuracy of converting the image into the structured document knot for the application scene which is easy to generate position offset.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: how to improve the accuracy of converting document pictures into structured documents.
In order to solve the technical problems, the invention adopts the technical scheme that:
the invention provides a method for creating a structured document based on a deep learning model, which comprises the following steps:
s1, presetting a training sample set; each sample in the training sample set comprises a document picture and an annotated document corresponding to the document picture; the annotation document records the position information and the category information of each key field in the document picture;
s2, training a preset first deep learning model by using the training sample set to obtain a second deep learning model;
s3, analyzing a first document picture by the second deep learning model to obtain position information and category information of each key field in the first document picture;
and S4, creating a structured document corresponding to the first document picture according to the position information and the category information of each key field in the first document picture.
Preferably, S4 specifically is:
s41, obtaining position information of a key field to obtain current position information;
s42, intercepting an image corresponding to the current position information on the first document picture to obtain a key field picture;
s43, identifying characters in the key field picture to obtain text information;
s44, adding the category information of the key field and the text information to a preset structured document;
and S45, repeatedly executing S41 to S44 until each key field corresponding to the first document picture is traversed.
Preferably, S2 is specifically:
s21, distributing a unique number for each category of information;
s22, the first deep learning model identifies a sample in the training sample set to obtain an information set; the information set comprises position information and category information;
s23, acquiring the annotation document corresponding to the sample to obtain the current annotation document;
s24, comparing the information set with the current labeled document, and calculating to obtain an error value; the information set and the category information in the current markup document are represented by the number;
s25, adjusting parameters of the first deep learning model according to the error value;
and S26, repeatedly executing S22 to S25 until the error value is smaller than a preset threshold value, and obtaining the second deep learning model.
Preferably, the first deep learning model is used for object detection.
The invention also provides a terminal for creating a structured document based on a deep learning model, comprising one or more processors and a memory, wherein the memory stores a program and is configured to be executed by the one or more processors to execute the following steps:
s1, presetting a training sample set; each sample in the training sample set comprises a document picture and an annotated document corresponding to the document picture; the annotation document records the position information and the category information of each key field in the document picture;
s2, training a preset first deep learning model by using the training sample set to obtain a second deep learning model;
s3, analyzing a first document picture by the second deep learning model to obtain position information and category information of each key field in the first document picture;
and S4, creating a structured document corresponding to the first document picture according to the position information and the category information of each key field in the first document picture.
Preferably, S4 specifically is:
s41, acquiring position information of a key field to obtain current position information;
s42, intercepting an image corresponding to the current position information on the first document picture to obtain a key field picture;
s43, identifying characters in the key field picture to obtain text information;
s44, adding the category information of the key field and the text information to a preset structured document;
and S45, repeatedly executing S41 to S44 until each key field corresponding to the first document picture is traversed.
Preferably, S2 is specifically:
s21, distributing a unique number for each category of information;
s22, the first deep learning model identifies a sample in the training sample set to obtain an information set; the information set comprises position information and category information;
s23, acquiring the annotation document corresponding to the sample to obtain the current annotation document;
s24, comparing the information set with the current labeled document, and calculating to obtain an error value; the information set and the category information in the current markup document are represented by the number;
s25, adjusting parameters of the first deep learning model according to the error value;
and S26, repeatedly executing S22 to S25 until the error value is smaller than a preset threshold value, and obtaining the second deep learning model.
Preferably, the first deep learning model is used for object detection.
The invention has the following beneficial effects:
1. the invention provides a method and a terminal for creating a structured document based on a deep learning model, which are different from the prior art that a field positioning task is simplified into the process of intercepting an image corresponding to a field from a fixed position in the image. According to the document structuring method provided by the invention, the key field can be arranged at any position on the document picture, so that the category and the text content of the key field can be correctly identified and matched in an application scene that the document picture is stored in a computer or the like in a scanning or photographing mode and the position of the key field is easy to deviate in the document picture, and the accuracy of converting the document picture into the structured document is improved. Meanwhile, for document pictures with multiple layout versions but the same substantial content, the positions of key fields of all categories can be identified by using the same model, and a layout version does not need to use a set of special key field position information to match like the prior art, so that resources are saved to a great extent, and the efficiency and accuracy of converting the document pictures into structured documents are improved.
2. Furthermore, the text information corresponding to the category information of one key field is identified according to the position information of the key field, and the category information belonging to the same key field is associated with the text information and stored in the structured document, which is beneficial to providing efficient document retrieval, document analysis and other intelligent services.
3. Furthermore, because the output of the deep learning model is digital, and the class information is represented by using the digital number in the labeled document, errors in the process of converting the output result of the deep learning model into the corresponding information class are avoided, and the accuracy of comparing the difference between the recognition result of the deep learning model and the standard result is improved, so that the accuracy of recognizing the information class of the second deep learning model obtained by training the training sample set is improved.
4. Furthermore, the first deep learning model is used for target detection, so that no matter where the key fields are located in the document picture, the key fields in the document picture can be identified through the second deep learning model obtained after training of the training sample set, and further the position information of the key fields is obtained. The method is different from the method for analyzing and counting the positions of the key fields by utilizing a large number of templates in the prior art, the key fields are extracted by using fixed frames to frame the fixed positions of the documents, the document positioning performance is easily influenced by document deformation, scanning deformation, overlong key field content or line crossing and other factors, and the idea of deep learning model target detection is applied to the positioning of the key fields of the documents, so that the method has high accuracy and flexibility and a wider application range.
Drawings
FIG. 1 is a flow diagram of a method of structuring a human document;
FIG. 2 is a flow diagram of a prior art document structuring method;
FIG. 3 is a flowchart of a method for creating a structured document based on a deep learning model according to an embodiment of the present invention;
FIG. 4 is a sample training sample;
FIG. 5 is a sample character fragment picture of the Total Key field;
FIG. 6 is a block diagram of a specific embodiment of a terminal for creating a structured document based on a deep learning model according to the present invention;
description of reference numerals:
1. a processor; 2. a memory.
Detailed Description
The invention is described in detail below with reference to the figures and the specific embodiments.
Referring to fig. 3 to fig. 6,
the first embodiment of the invention is as follows:
as shown in FIG. 3, the invention provides a method for creating a structured document based on a deep learning model, which comprises the following steps:
s1, presetting a training sample set; each sample in the training sample set comprises a document picture and an annotated document corresponding to the document picture; and the annotation document records the position information and the category information of each key field in the document picture.
For example, 1000 bill pictures are collected and processed to be used as samples, one part of the samples are used as training samples, and one part of the samples are used as testing samples. Each ticket includes a number of fields including key fields of interest. Each sample comprises a document picture and a document with key fields labeled. The annotation document records the position of each key field in the document picture and the category information of the key field. Document labeling can be purely manual or can be achieved by deep learning pre-labeling and then manual correction. FIG. 4 is a sample case of a generic quota invoice with the location and type of the four key fields (invoice type, invoice code, invoice number and total amount) noted. The sample used for training and testing can be supplemented continuously.
And S2, training a preset first deep learning model by using the training sample set to obtain a second deep learning model. The method specifically comprises the following steps:
s21, distributing a unique number for each category of information;
s22, the first deep learning model identifies a sample in the training sample set to obtain an information set; the information set comprises position information and category information;
preferably, the first deep learning model is used for object detection.
For example, there are some well-established deep learning models for object detection, fast-RCNN, SSD, yolo, etc., which can be used to detect whether a given object, such as a cat, a dog, an airplane, etc., is in the image. The present embodiment adopts the existing deep learning network model for target detection as the first deep learning model to be trained, but is innovatively used to detect different key fields. Different key fields belong to different categories and the content of the same key field may vary.
The first deep learning model is used for target detection, so that the key fields in the document picture can be identified no matter where the key fields are located in the document picture by the second deep learning model obtained after training through the training sample set, and further the position information of the key fields is obtained. The method is different from the method for analyzing and counting the positions of the key fields by utilizing a large number of templates in the prior art, the key fields are extracted by using fixed frames to frame the fixed positions of the documents, the document positioning performance is easily influenced by document deformation, scanning deformation, overlong key field content or line crossing and other factors, and the idea of deep learning model target detection is applied to the positioning of the key fields of the documents, so that the method has high accuracy and flexibility and a wider application range.
S23, acquiring the annotation document corresponding to the sample to obtain the current annotation document;
s24, comparing the information set with the current labeled document, and calculating to obtain an error value; the information set and the category information in the current markup document are represented by the number;
because the output of the deep learning model is digital, the class information is represented by using the digital number in the labeled document, errors in the process of converting the output result of the deep learning model into the corresponding information class are avoided, the accuracy of comparing the difference between the recognition result of the deep learning model and the standard result is improved, and the accuracy of recognizing the information class of the second deep learning model obtained by training the training sample set is improved.
S25, adjusting parameters of the first deep learning model according to the error value;
and S26, repeatedly executing S22 to S25 until the error value is smaller than a preset threshold value, and obtaining the second deep learning model.
In this embodiment, the deep learning model structure adopts a convolutional neural network, a Long Short Term Memory (LSTM) network, and a CTC structure. The convolutional neural network has a plurality of stages (stages), each of which contains a number of convolution modules (extracting image features) and pooling layers (reducing feature map size), etc.
For example, before the training samples are input into the first deep learning model, each key field of interest is assigned a unique number. The first deep learning model detects key fields in the input training samples and outputs the position of each detected key field and the number corresponding to the key field. In the training process, the training samples are directly input into the first deep learning model and can be used as a 3-dimensional matrix to represent the training samples in a computer. For example, I _ (w 0, h0, c 0), where w0 represents the width (number of pixels) of the document picture in the input training sample, h0 represents the height of the document picture, c0 represents the color channel of the document picture, the color picture has three color channels of red, blue and green, and the grayscale picture has only one color channel. And then comparing the position information of the key fields in the labeled documents of the training samples and the category information expressed by the number numbers with the output of the first deep learning model, calculating the weighted comprehensive error of positioning and classification, reversely inputting the weighted comprehensive error into the first deep learning model according to the positioning and classification, adjusting the parameters of the deep learning network, continuing learning, testing the trained first deep learning model on a test sample set until the positioning and classification errors of the first deep learning model are reduced to a certain degree and have better positioning and classification capabilities, and stopping training to obtain a trained second deep learning model.
And S3, analyzing the first document picture by the second deep learning model to obtain the position information and the category information of each key field in the first document picture.
And S4, creating a structured document corresponding to the first document picture according to the position information and the category information of each key field in the first document picture. The method comprises the following specific steps:
s41, obtaining the position information of a key field to obtain the current position information.
The current position information is four vertex coordinates of a minimum square which can completely contain the key field.
And S42, intercepting an image corresponding to the current position information on the first document picture to obtain a key field picture.
Wherein, a key field corresponds to a key field picture.
And S43, identifying characters in the key field picture to obtain text information.
Before S43, a third deep learning model for recognizing characters in the key field picture needs to be trained; and the third deep learning model is used for identifying characters in the key field picture to obtain text information. The method specifically comprises the following steps:
collecting a certain number of character fragment pictures (for example, 100000 pictures), and processing the pictures to be used as samples for deep learning character recognition, wherein a part of the samples are used as training samples, and a part of the samples are used as test samples. Each picture corresponds to a key field. Each character fragment sample comprises a character fragment picture and an annotation document corresponding to the character fragment picture. And recording the character content of the character fragment pictures in the label document corresponding to the character fragment pictures. The marking of the character segment samples can adopt a purely manual method or a method of adopting deep learning pre-marking and then using manual correction. Fig. 5 shows a sample of a character fragment image of a total amount key field, and the character content recorded in the markup document corresponding to the character fragment is 4500.00. The training sample can be continuously supplemented. A third depth model for character recognition is trained using a training sample set.
Before the training samples are input into the deep learning model for training, the character labels are converted into numerical labels, and each interested Chinese character, english letter, number and punctuation mark are mapped into a unique and different numerical number. The deep learning is to detect each character in the input training picture and output the number corresponding to the detected character, that is, to classify the detected character.
In the training process, character segment pictures are directly input into the deep learning network and can be represented as a 3-dimensional matrix in a computer. And the number of the training sample is used for comparing with the output of the deep learning model, calculating the recognition error and adjusting the network parameters. After passing through the convolution module of the deep learning network, the features of the training picture are extracted, and a feature map with a certain number of channels, such as F _ (w 1, h1, c 1), is output, where w1, h1, and c1 respectively represent the width, height, and number of channels of the feature map after passing through the convolution module. After passing through the multi-stage convolution module and the pooling layer, the feature map (denoted as F _ (wn, hn, cn)) output by the convolution network is fed as input into a long-term memory (LSTM) network. Feature information (including height dimension and channel dimension) is input to the LSTM network one by one for each column (corresponding to one pixel width) in the width direction of the feature map, and each column outputs the probabilities of all possible characters and one extra character (representing no character). The output of the LSTM network is processed by the CTC module, the integer code of the identified effective character is output, and the effective character obtained by the deep learning model identification is output through mapping conversion. And comparing effective characters obtained by the deep learning model recognition with the self-carried labeled documents of the training samples, calculating the error of the deep learning network recognition, reversely inputting the effective characters into the deep learning model according to the recognition error, adjusting the parameters of the deep learning model, continuing learning, and stopping training until the recognition error of the deep learning network is reduced to a certain degree and has better recognition capability, thereby obtaining a third deep learning model.
And identifying characters in the key field picture by using a traditional identification model to obtain text information.
And S44, adding the category information and the text information of the key field to a preset structured document.
The structured document of the embodiment comprises a category field and a text content field; each record in the structured document stores information about a key field in the picture of the document.
For example, converting the ticket shown in FIG. 4 into a structured document is shown in Table 1:
TABLE 1
Categories Text content
BillTittle Xiamen city XX fast moving limited company quota invoice
InvoiceCode 1350214543xx
InvoiceNo 00369040
TotalAmount One-hundred-yuan whole
And S45, repeatedly executing S41 to S44 until each key field corresponding to the first document picture is traversed.
The embodiment provides a method and a terminal for creating a structured document based on a deep learning model, which are different from the prior art that a field positioning task is simplified into an image corresponding to a field intercepted from a fixed position in the image. According to the document structuring method provided by the invention, the key field can be arranged at any position on the document picture, so that the category and the text content of the key field can be correctly identified and matched in an application scene that the document picture is stored in a computer or the like in a scanning or photographing mode and the position of the key field is easy to deviate in the document picture, and the accuracy of converting the document picture into the structured document is improved. Meanwhile, for document pictures with various layout versions but the same substantial content, the positions of key fields of various types can be recognized by using the same model, and the situation that one layout version needs to be matched by using a set of special key field position information like the prior art is not needed, so that resources are greatly saved, and the efficiency and the accuracy of converting the document pictures into the structured documents are improved. Compared with the existing manual scheme and fixed position character recognition scheme, the method and the device can greatly improve the speed and accuracy of creating the structured document, reduce the cost of the structured document creation system, facilitate the increase of the scale of the structured document creation system and support more users.
The second embodiment of the invention is as follows:
as shown in fig. 6, the present invention further provides a terminal for creating a structured document based on a deep learning model, which includes one or more processors 1 and a memory 2, wherein the memory 2 stores a program and is configured to be executed by the one or more processors 1 to perform the following steps:
s1, presetting a training sample set; each sample in the training sample set comprises a document picture and a labeled document corresponding to the document picture; and the annotation document records the position information and the category information of each key field in the document picture.
For example, 1000 bill pictures are collected and processed to be used as samples, a part of the samples are used as training samples, and a part of the samples are used as testing samples. Each ticket includes a number of fields including key fields of interest. Each sample comprises a document picture and a document with key fields labeled. The annotation document records the position of each key field in the document picture and the category information of the key field. Document labeling can be purely manual or can be achieved by deep learning pre-labeling and then manual correction. FIG. 4 is a sample case of a generic quota invoice with the location and type of the four key fields (invoice type, invoice code, invoice number and total amount) noted. The sample used for training and testing can be supplemented continuously.
And S2, training a preset first deep learning model by using the training sample set to obtain a second deep learning model. The method specifically comprises the following steps:
s21, distributing a unique number for each category of information;
s22, the first deep learning model identifies a sample in the training sample set to obtain an information set; the information set comprises position information and category information;
preferably, the first deep learning model is used for object detection.
For example, there are some well-established deep learning models for object detection, fast-RCNN, SSD, yolo, etc., which can be used to detect whether there is a given object in an image, such as a cat, a dog, an airplane, etc. The present embodiment adopts the existing deep learning network model for target detection as the first deep learning model to be trained, but is innovatively used to detect different key fields. Different key fields belong to different categories and the content of the same key field may vary.
The first deep learning model is used for target detection, so that the key fields in the document picture can be identified no matter where the key fields are located in the document picture by the second deep learning model obtained after training through the training sample set, and further the position information of the key fields is obtained. The method is different from the method for analyzing and counting the positions of the key fields by utilizing a large number of templates in the prior art, the key fields are extracted by using fixed frames to frame the fixed positions of the documents, the document positioning performance is easily influenced by document deformation, scanning deformation, overlong key field content or line crossing and other factors, and the idea of deep learning model target detection is applied to the positioning of the key fields of the documents, so that the method has high accuracy and flexibility and a wider application range.
S23, acquiring the annotation document corresponding to the sample to obtain the current annotation document;
s24, comparing the information set with the current labeled document, and calculating to obtain an error value; the information set and the category information in the current markup document are both represented by the digital numbers;
the output of the deep learning model is digital, and the class information is represented by using a digital number in the annotation document, so that errors in the process of converting the output result of the deep learning model into the corresponding information class are avoided, the accuracy of comparing the difference between the recognition result of the deep learning model and the standard result is improved, and the accuracy of recognizing the information class of the second deep learning model obtained by training the training sample set is improved.
S25, adjusting parameters of the first deep learning model according to the error value;
and S26, repeatedly executing S22 to S25 until the error value is smaller than a preset threshold value, and obtaining the second deep learning model.
In this embodiment, the deep learning model structure adopts a convolutional neural network, a Long Short Term Memory (LSTM) network, and a CTC structure. The convolutional neural network has a plurality of stages (stages), each of which contains a number of convolution modules (extracting image features) and pooling layers (reducing feature map size), etc.
For example, each key field of interest is assigned a unique number before the training samples are input to the first deep learning model. The first deep learning model detects key fields in the input training sample and outputs the position of each detected key field and the number corresponding to the key field. In the training process, the training samples are directly input into the first deep learning model and can be represented as a 3-dimensional matrix in a computer. For example, I _ (w 0, h0, c 0), where w0 represents the width (number of pixels) of the document picture in the input training sample, h0 represents the height of the document picture, c0 represents the color channel of the document picture, the color picture has three color channels of red, blue and green, and the grayscale picture has only one color channel. And then comparing the position information of the key fields in the labeled documents of the training samples and the category information expressed by the number numbers with the output of the first deep learning model, calculating the weighted comprehensive error of positioning and classification, reversely inputting the weighted comprehensive error into the first deep learning model according to the positioning and classification, adjusting the parameters of the deep learning network, continuing learning, testing the trained first deep learning model on a test sample set until the positioning and classification errors of the first deep learning model are reduced to a certain degree and have better positioning and classification capabilities, and stopping training to obtain a trained second deep learning model.
And S3, analyzing the first document picture by the second deep learning model to obtain the position information and the category information of each key field in the first document picture.
And S4, creating a structured document corresponding to the first document picture according to the position information and the category information of each key field in the first document picture. The method specifically comprises the following steps:
s41, obtaining the position information of a key field to obtain the current position information.
The current position information is four vertex coordinates of a minimum square which can completely contain the key field.
And S42, intercepting an image corresponding to the current position information on the first document picture to obtain a key field picture.
Wherein, a key field corresponds to a key field picture.
And S43, identifying characters in the key field picture to obtain text information.
Before S43, a third deep learning model for recognizing characters in a key field picture needs to be trained, where the third deep learning model is used to recognize characters in the key field picture to obtain text information. The method specifically comprises the following steps:
collecting a certain number of character fragment pictures (for example, 100000 pictures), and processing the pictures to be used as samples for deep learning character recognition, wherein a part of the samples are used as training samples, and a part of the samples are used as test samples. Each picture corresponds to a key field. Each character fragment sample comprises a character fragment picture and an annotation document corresponding to the character fragment picture. And recording the character content of the character fragment pictures in the label document corresponding to the character fragment pictures. The marking of the character segment samples can adopt a purely manual method or a method of adopting deep learning pre-marking and then using manual correction. Fig. 5 shows a sample of a character fragment image of a total amount key field, and the character content recorded in the markup document corresponding to the character fragment is 4500.00. The training sample can be continuously supplemented. The third depth model for character recognition is trained using a training sample set.
Before the training samples are input into the deep learning model for training, the character labels are converted into numerical labels, and each interested Chinese character, english letter, number and punctuation mark are mapped into a unique and different numerical number. The deep learning is to detect each character in the input training picture and output the number corresponding to the detected character, that is, to classify the detected character.
In the training process, character segment pictures are directly input into the deep learning network and can be represented as a 3-dimensional matrix in a computer. And the number of the training sample is used for comparing with the output of the deep learning model, calculating the identification error and adjusting the network parameters. After passing through the convolution module of the deep learning network, the features of the training picture are extracted, and a feature map with a certain number of channels, such as F _ (w 1, h1, c 1), is output, where w1, h1, and c1 respectively represent the width, height, and number of channels of the feature map after passing through the convolution module. After passing through the multi-stage convolution module and the pooling layer, the feature map (denoted as F _ (wn, hn, cn)) output by the convolution network is fed as input into a long-short-term memory (LSTM) network. Feature information (including height dimension and channel dimension) for each column (corresponding to one pixel width) in the width direction of the feature map is input to the LSTM network one by one, and each column outputs the probabilities of all possible characters and one additional character (representing no characters). The output of the LSTM network is processed by the CTC module, the integer code of the identified effective character is output, and the effective character obtained by the deep learning model identification is output through mapping conversion. And comparing effective characters obtained by deep learning model recognition with labeled documents carried by training samples, calculating errors of deep learning network recognition, reversely inputting the effective characters into the deep learning model according to the recognition errors, adjusting parameters of the deep learning model, continuing learning, stopping training until the recognition errors of the deep learning network are reduced to a certain degree and have better recognition capability, and obtaining a third deep learning model.
And identifying characters in the key field picture by using a traditional identification model to obtain text information.
And S44, adding the category information and the text information of the key field to a preset structured document.
The structured document of the embodiment comprises a category field and a text content field; each record in the structured document stores information about a key field in the document picture.
For example, converting the ticket shown in FIG. 4 into a structured document is shown in Table 2:
TABLE 2
Categories Text content
BillTittle Xiamen city XX fast moving limited company quota invoice
InvoiceCode 1350214543xx
InvoiceNo 00369040
TotalAmount One-hundred-yuan whole
And S45, repeatedly executing S41 to S44 until each key field corresponding to the first document picture is traversed.
The embodiment provides a method and a terminal for creating a structured document based on a deep learning model, which are different from the prior art that a field positioning task is simplified into the process of intercepting an image corresponding to a field from a fixed position in the image. According to the document structuring method provided by the invention, the key field can be arranged at any position on the document picture, so that the category and the text content of the key field can be correctly identified and matched in an application scene that the document picture is stored in a computer or the like in a scanning or photographing mode and the position of the key field is easy to deviate in the document picture, and the accuracy of converting the document picture into the structured document is improved. Meanwhile, for document pictures with various layout versions but the same substantial content, the positions of key fields of various types can be recognized by using the same model, and the situation that one layout version needs to be matched by using a set of special key field position information like the prior art is not needed, so that resources are greatly saved, and the efficiency and the accuracy of converting the document pictures into the structured documents are improved. Compared with the existing manual scheme and fixed position character recognition scheme, the method and the device can greatly improve the speed and accuracy of creating the structured document, reduce the cost of the structured document creation system, facilitate the increase of the scale of the structured document creation system and support more users.
The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes performed by the present specification and drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (4)

1. A method for creating a structured document based on a deep learning model is characterized by comprising the following steps:
s1, presetting a training sample set; collecting 1000 bill pictures, and processing the bill pictures to be used as samples; each sample in the training sample set comprises a document picture and a labeled document corresponding to the document picture; the annotation document records the position information and the category information of each key field in the document picture;
s2, training a preset first deep learning model by using the training sample set to obtain a second deep learning model;
s3, analyzing a first document picture by the second deep learning model to obtain position information and category information of each key field in the first document picture;
s4, creating a structured document corresponding to the first document picture according to the position information and the category information of each key field in the first document picture;
the S4 specifically comprises the following steps:
s41, obtaining position information of a key field to obtain current position information;
s42, intercepting an image corresponding to the current position information on the first document picture to obtain a key field picture;
s43, identifying characters in the key field picture to obtain text information;
s44, adding the category information of the key field and the text information to a preset structured document;
s45, repeatedly executing S41 to S44 until each key field corresponding to the first document picture is traversed;
the S2 specifically comprises the following steps:
s21, distributing a unique number for each category of information;
s22, the first deep learning model identifies a sample in the training sample set to obtain an information set; the information set comprises position information and category information;
s23, acquiring the annotation document corresponding to the sample to obtain the current annotation document;
s24, comparing the information set with the current labeled document, and calculating to obtain an error value; the information set and the category information in the current markup document are both represented by the digital numbers;
s25, adjusting parameters of the first deep learning model according to the error value;
and S26, repeatedly executing S22 to S25 until the error value is smaller than a preset threshold value to obtain the second deep learning model, wherein the structure of the second deep learning model adopts a convolutional neural network, a long-time and short-time memory network and a CTC structure.
2. The deep learning model-based method for creating a structured document according to claim 1, wherein the first deep learning model is used for object detection.
3. A deep learning model-based terminal for creating structured documents, comprising one or more processors and memory, the memory storing a program and configured to perform the following steps by the one or more processors:
s1, presetting a training sample set; collecting 1000 bill pictures, and processing the bill pictures to be used as samples; each sample in the training sample set comprises a document picture and a labeled document corresponding to the document picture; the annotation document records the position information and the category information of each key field in the document picture;
s2, training a preset first deep learning model by using the training sample set to obtain a second deep learning model;
s3, analyzing a first document picture by the second deep learning model to obtain position information and category information of each key field in the first document picture;
s4, creating a structured document corresponding to the first document picture according to the position information and the category information of each key field in the first document picture;
the S4 specifically comprises the following steps:
s41, obtaining position information of a key field to obtain current position information;
s42, intercepting an image corresponding to the current position information on the first document picture to obtain a key field picture;
s43, identifying characters in the key field picture to obtain text information;
s44, adding the category information of the key field and the text information to a preset structured document;
s45, repeatedly executing S41 to S44 until each key field corresponding to the first document picture is traversed;
the S2 specifically comprises the following steps:
s21, distributing a unique number for each category of information;
s22, the first deep learning model identifies a sample in the training sample set to obtain an information set; the information set comprises position information and category information;
s23, acquiring the annotation document corresponding to the sample to obtain the current annotation document;
s24, comparing the information set with the current labeled document, and calculating to obtain an error value; the information set and the category information in the current markup document are represented by the number;
s25, adjusting parameters of the first deep learning model according to the error value;
and S26, repeatedly executing S22 to S25 until the error value is smaller than a preset threshold value to obtain the second deep learning model, wherein the structure of the second deep learning model adopts a convolutional neural network, a long-time and short-time memory network and a CTC structure.
4. The deep learning model-based terminal for creating structured documents according to claim 3, wherein the first deep learning model is used for target detection.
CN201910074243.2A 2019-01-25 2019-01-25 Method and terminal for creating structured document based on deep learning model Active CN109816118B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910074243.2A CN109816118B (en) 2019-01-25 2019-01-25 Method and terminal for creating structured document based on deep learning model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910074243.2A CN109816118B (en) 2019-01-25 2019-01-25 Method and terminal for creating structured document based on deep learning model

Publications (2)

Publication Number Publication Date
CN109816118A CN109816118A (en) 2019-05-28
CN109816118B true CN109816118B (en) 2022-12-06

Family

ID=66604985

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910074243.2A Active CN109816118B (en) 2019-01-25 2019-01-25 Method and terminal for creating structured document based on deep learning model

Country Status (1)

Country Link
CN (1) CN109816118B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11854246B2 (en) 2020-06-09 2023-12-26 Beijing Baidu Netcom Science And Technology Co., Ltd. Method, apparatus, device and storage medium for recognizing bill image

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110516125B (en) * 2019-08-28 2020-05-08 拉扎斯网络科技(上海)有限公司 Method, device and equipment for identifying abnormal character string and readable storage medium
CN110888926B (en) * 2019-10-22 2022-10-28 北京百度网讯科技有限公司 Method and device for structuring medical text
CN112699906B (en) * 2019-10-22 2023-09-22 杭州海康威视数字技术股份有限公司 Method, device and storage medium for acquiring training data
CN110826488B (en) * 2019-11-06 2022-07-26 思必驰科技股份有限公司 Image identification method and device for electronic document and storage equipment
CN111539416A (en) * 2020-04-28 2020-08-14 深源恒际科技有限公司 End-to-end method for text detection target extraction relation based on deep neural network
US11443082B2 (en) 2020-05-27 2022-09-13 Accenture Global Solutions Limited Utilizing deep learning and natural language processing to convert a technical architecture diagram into an interactive technical architecture diagram
CN111652117B (en) * 2020-05-29 2023-07-04 上海深杳智能科技有限公司 Method and medium for segmenting multiple document images
CN112232336A (en) * 2020-09-02 2021-01-15 深圳前海微众银行股份有限公司 Certificate identification method, device, equipment and storage medium
CN112949574B (en) * 2021-03-29 2022-09-27 中国科学院合肥物质科学研究院 Deep learning-based cascading text key field detection method
CN112990091A (en) * 2021-04-09 2021-06-18 数库(上海)科技有限公司 Research and report analysis method, device, equipment and storage medium based on target detection
CN113127595B (en) * 2021-04-26 2022-08-16 数库(上海)科技有限公司 Method, device, equipment and storage medium for extracting viewpoint details of research and report abstract
CN113221792B (en) * 2021-05-21 2022-09-27 北京声智科技有限公司 Chapter detection model construction method, cataloguing method and related equipment
CN113743361A (en) * 2021-09-16 2021-12-03 上海深杳智能科技有限公司 Document cutting method based on image target detection
CN116886955B (en) * 2023-07-24 2024-04-16 北京泰策科技有限公司 Video analysis method and system based on ffmpeg and yolov5

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103854019A (en) * 2012-11-29 2014-06-11 北京千橡网景科技发展有限公司 Method and device for extracting fields in image
CN107133621A (en) * 2017-05-12 2017-09-05 江苏鸿信系统集成有限公司 The classification of formatting fax based on OCR and information extracting method
CN108108387A (en) * 2016-11-23 2018-06-01 谷歌有限责任公司 Structured document classification and extraction based on masterplate
CN108133212A (en) * 2018-01-05 2018-06-08 东华大学 A kind of quota invoice amount identifying system based on deep learning
CN109034159A (en) * 2018-05-28 2018-12-18 北京捷通华声科技股份有限公司 image information extracting method and device
CN109086756A (en) * 2018-06-15 2018-12-25 众安信息技术服务有限公司 A kind of text detection analysis method, device and equipment based on deep neural network

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070172130A1 (en) * 2006-01-25 2007-07-26 Konstantin Zuev Structural description of a document, a method of describing the structure of graphical objects and methods of object recognition.
RU2651144C2 (en) * 2014-03-31 2018-04-18 Общество с ограниченной ответственностью "Аби Девелопмент" Data input from images of the documents with fixed structure

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103854019A (en) * 2012-11-29 2014-06-11 北京千橡网景科技发展有限公司 Method and device for extracting fields in image
CN108108387A (en) * 2016-11-23 2018-06-01 谷歌有限责任公司 Structured document classification and extraction based on masterplate
CN107133621A (en) * 2017-05-12 2017-09-05 江苏鸿信系统集成有限公司 The classification of formatting fax based on OCR and information extracting method
CN108133212A (en) * 2018-01-05 2018-06-08 东华大学 A kind of quota invoice amount identifying system based on deep learning
CN109034159A (en) * 2018-05-28 2018-12-18 北京捷通华声科技股份有限公司 image information extracting method and device
CN109086756A (en) * 2018-06-15 2018-12-25 众安信息技术服务有限公司 A kind of text detection analysis method, device and equipment based on deep neural network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
A novel text structure feature extractor for Chinese scene text detection and recognition;Xiaohang Ren;《A novel text structure feature extractor for Chinese scene text detection and recognition》;20170424;全文 *
半结构化文档中语义信息抽取方法的研究;李毅;《中国优秀硕士论文电子期刊网》;20050715;全文 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11854246B2 (en) 2020-06-09 2023-12-26 Beijing Baidu Netcom Science And Technology Co., Ltd. Method, apparatus, device and storage medium for recognizing bill image

Also Published As

Publication number Publication date
CN109816118A (en) 2019-05-28

Similar Documents

Publication Publication Date Title
CN109800761B (en) Method and terminal for creating paper document structured data based on deep learning model
CN109816118B (en) Method and terminal for creating structured document based on deep learning model
CN109902622B (en) Character detection and identification method for boarding check information verification
CN107633239B (en) Bill classification and bill field extraction method based on deep learning and OCR
EP3437019B1 (en) Optical character recognition in structured documents
CN113837151B (en) Table image processing method and device, computer equipment and readable storage medium
CN115424282A (en) Unstructured text table identification method and system
CN116052193B (en) RPA interface dynamic form picking and matching method and system
CN113591866A (en) Special job certificate detection method and system based on DB and CRNN
CN111027456A (en) Mechanical water meter reading identification method based on image identification
CN110796145B (en) Multi-certificate segmentation association method and related equipment based on intelligent decision
CN114694130A (en) Method and device for detecting telegraph poles and pole numbers along railway based on deep learning
CN110647824A (en) Value-added tax invoice layout extraction method based on computer vision technology
CN113780116A (en) Invoice classification method and device, computer equipment and storage medium
CN111914706A (en) Method and device for detecting and controlling quality of character detection output result
US20230154217A1 (en) Method for Recognizing Text, Apparatus and Terminal Device
CN111008635A (en) OCR-based multi-bill automatic identification method and system
CN115713775A (en) Method, system and computer equipment for extracting form from document
CN113420116B (en) Medical document analysis method, device, equipment and medium
CN112257525A (en) Logistics vehicle card punching identification method, device, equipment and storage medium
CN113610043A (en) Industrial drawing table structured recognition method and system
CN116306576B (en) Book printing error detection system and method thereof
CN117372510B (en) Map annotation identification method, terminal and medium based on computer vision model
CN116994282B (en) Reinforcing steel bar quantity identification and collection method for bridge design drawing
Jiao et al. Research on automatic identification algorithm of invoice information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant