CN115116079A - Image-based official document element information extraction method and device - Google Patents

Image-based official document element information extraction method and device Download PDF

Info

Publication number
CN115116079A
CN115116079A CN202210657233.3A CN202210657233A CN115116079A CN 115116079 A CN115116079 A CN 115116079A CN 202210657233 A CN202210657233 A CN 202210657233A CN 115116079 A CN115116079 A CN 115116079A
Authority
CN
China
Prior art keywords
image
document
document element
official
format
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210657233.3A
Other languages
Chinese (zh)
Inventor
程世清
王思宇
陈仁平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
31511 Unit Of Chinese Pla
Original Assignee
31511 Unit Of Chinese Pla
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 31511 Unit Of Chinese Pla filed Critical 31511 Unit Of Chinese Pla
Priority to CN202210657233.3A priority Critical patent/CN115116079A/en
Publication of CN115116079A publication Critical patent/CN115116079A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/413Classification of content, e.g. text, photographs or tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/148Segmentation of character regions
    • G06V30/15Cutting or merging image elements, e.g. region growing, watershed or clustering-based techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Character Input (AREA)

Abstract

The embodiment of the invention provides a method and a device for extracting document element information based on an image, wherein the method comprises the following steps: acquiring documents generated by an organization, converting documents in different storage formats into images in a preset format, and preprocessing the images in the preset format to obtain preprocessed document images; detecting each official document element area in the preprocessed official document image through a pre-trained detection model, cutting each official document element area to obtain a corresponding official document element image, and identifying text contents in each official document element image; and extracting corresponding document element contents aiming at the text contents in each identified document element image, and outputting all document element contents in the document according to a preset format. Based on the image processing technology, the element information of the official documents in various formats can be extracted.

Description

Image-based official document element information extraction method and device
Technical Field
The invention relates to the field of file processing, in particular to a method and a device for extracting document element information based on an image.
Background
Party, political and military offices generate a large amount of official documents in daily offices, and the unstructured data have important value. The extraction of the document key element information has important significance on document data structured conversion, automatic management and intelligent office. At present, organ official document element information is extracted mainly by depending on a more standard format of the organ official document and utilizing a regular expression to perform matching extraction, but the method has two problems:
firstly, based on a text processing method, the method is only limited to extracting documents in text formats such as doc, docx, txt and the like, and cannot extract information of image documents such as scanning, copying and the like;
secondly, although the official documents strictly regulate the presentation forms of the format, the elements, the font and the like, the typesetting is very different, some typesets are in a form of a hidden frame, some typesets are in a blank form of a paragraph carriage return line and a small character size, and some typesets are in an insertion straight line shape; in the content simulation, key words and contents of the header and the version are written as a whole in some cases, and are opened in two table cells, so that the positions of the front text and the rear text of the text information are not uniform, the matching relation is not fixed, the extraction generalization capability by using rule matching is poor, the error probability is high, the extraction result needs a large amount of time to correct, and the large-scale information extraction and the structural transformation of the official document information are difficult to realize as a general means.
Disclosure of Invention
The embodiment of the invention provides an image-based official document element information extraction method and device, which can extract element information of official documents in various formats based on an image processing technology.
To achieve the above object, in one aspect, an embodiment of the present invention provides an image-based method for extracting document element information, including:
acquiring documents generated by an organization, converting documents in different storage formats into images in a preset format, and preprocessing the images in the preset format to obtain preprocessed document images;
detecting each official document element area in the preprocessed official document image through a pre-trained detection model, cutting each official document element area to obtain a corresponding official document element image, and identifying text contents in each official document element image; wherein the document elements include at least one of: number of official documents, secret grade and confidentiality period, emergency degree, issuing organization identification, issuing character number, signing and issuing person, subject word, copying organization, contact person, telephone and mail box;
and extracting corresponding document element contents aiming at the text contents in each identified document element image, and outputting all document element contents in the document according to a preset format.
On the other hand, an embodiment of the present invention provides an image-based apparatus for extracting document element information, including:
the document preprocessing module is used for acquiring documents generated by an organization, converting the documents with different storage formats into images with preset formats and preprocessing the images with the preset formats to obtain preprocessed document images;
the official document element region segmentation module is used for detecting each official document element region in the preprocessed official document image through a pre-trained detection model, cutting each official document element region to obtain a corresponding official document element image, and identifying text contents in each official document element image; wherein the document elements include at least one of: number of official documents, secret grade and confidentiality period, emergency degree, issuing organization identification, issuing character number, signing and issuing person, subject word, copying organization, contact person, telephone and mail box;
and the document element content extraction module is used for extracting corresponding document element contents aiming at the text contents in each identified document element image, and outputting all document element contents in the document according to a preset format.
The technical scheme has the following beneficial effects: based on the image processing technology, the element information of the official documents in various formats can be extracted.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flow chart of a method for extracting document element information based on images according to an embodiment of the present invention;
FIG. 2 is a block diagram of an apparatus for extracting document element information based on images according to an embodiment of the present invention;
FIG. 3 is a general flow diagram of an embodiment of the present invention;
FIG. 4 is a flowchart illustrating document element detection and identification according to an embodiment of the present invention;
FIG. 5 is a graph of model training results for an embodiment of the present invention;
fig. 6 is a diagram of detecting and identifying effects of contact phone and mailbox element areas according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
As shown in fig. 1, in combination with the embodiment of the present invention, there is provided an image-based document element information extraction method, including:
s101: acquiring documents generated by an organization, converting documents in different storage formats into images in a preset format, and preprocessing the images in the preset format to obtain preprocessed document images;
s102: detecting each official document element area in the preprocessed official document image through a pre-trained detection model, cutting each official document element area to obtain a corresponding official document element image, and identifying text contents in each official document element image; wherein the document elements include at least one of: number of official documents, secret grade and confidentiality period, emergency degree, issuing organization identification, issuing character number, signing and issuing person, subject word, copying organization, contact person, telephone and mail box;
s103: and extracting corresponding document element contents from the text contents in each recognized document element image, and outputting all document element contents in the document according to a preset format.
Preferably, in step 101, the converting documents in different storage formats into image formats includes:
s1011: if the acquired official document is in a document type format, firstly converting the format of the official document into a PDF format, and then converting the PDF format into a preset format image with the image resolution within a preset range;
s1012: if the acquired official document is in the picture type format, unifying the format of the official document into a preset format image with the image resolution within a preset range.
Preferably, the pictures in the picture format include a shot picture, a scanned picture, and a copied picture;
in step 101, the preprocessing the image with the preset format further includes:
s1013: if the acquired official document photo type format and the official document in the document type format contain the photo, if the document type format is PDF and the PDF is formed by non-document type conversion, adjusting the image contrast and the image brightness according to the image contrast parameter default value and the image brightness parameter default value for each pixel point of the corresponding preset format image; when the enhancement effect of the preprocessed official document image cannot meet the preset requirement according to the image contrast parameter default value and the image brightness parameter default value, adjusting the image contrast parameter value and the image brightness parameter value until the enhancement effect can meet the preset requirement, and obtaining an enhanced image;
s1014: and removing noise points of the enhanced image by adopting median filtering to obtain a preprocessed document image with complete character edge reservation and no character information loss.
Preferably, in step 102, detecting each document element area in the preprocessed document image through a pre-trained detection model, cutting each document element area to obtain a corresponding document element image, and identifying text content in each document element image, specifically including:
s1021: automatically detecting each document element region in the preprocessed document image and corresponding document element types through a pre-trained YOLOV5 model to form output results of each document element region; the official document element region output result is expressed as: the method comprises the following steps of classifying document elements, detecting frame central point x, detecting frame central point y, detecting frame width w, detecting frame height h and confidence;
s1022: converting the central point x, the central point y, the width w and the height h of the detection frame into coordinate value formats (x, y) for dividing the corresponding document element areas 1 ,y 1 ,x 2 ,y 2 ) B, carrying out the following steps of; wherein x is 1 、y 1 Respectively are the horizontal and vertical coordinate values, x, of the top left vertex of the document element region frame 2 、y 2 Respectively are the horizontal and vertical coordinate values, x, of the lower right vertex of the regional frame of the official document element 1 =x-w/2;y 1 =y-h/2;x 2 =x+w/2;y 2 =y+h/2;
S1023: calling the crop () function of the Image module of the Python Image processing library PIL to obtain the coordinate values (x) of each document element region 1 ,y 1 ,x 2 ,y 2 ) Inputting a crop () function to divide each document element region, returning a corresponding document element image after the division is finished, and marking document element types and confidence degrees in the document element images;
s1024: and aiming at each document element image, identifying and extracting the text content in the document element image by adopting a pytesseract module of Python language, and returning the extracted text content in a txt file.
Preferably, in step 103, for the text content in each identified document element image, extracting corresponding document element content, and outputting all the extracted document element contents in the document according to a preset format, specifically including:
s1031: aiming at the txt file of each document element image, if the text content in the txt file does not contain keywords, directly taking the text content as element content; if the text content in the txt file contains keywords, taking the keywords and the text content except the redundant punctuations as the extracted element content; extracting element contents in the txt file through a python character string replacing function replace;
s1032: constructing a key value pair aiming at each extracted text content, expressing the document element category through a key of the key value pair, and expressing the extracted element content through the value of the key value pair; and composing all key-value pairs in the document into a dictionary and outputting the dictionary.
As shown in fig. 2, in accordance with an embodiment of the present invention, there is provided an image-based document factor information extraction apparatus, including:
the document preprocessing module 21 is configured to acquire documents generated by an organization, convert documents in different storage formats into images in a preset format, and preprocess the images in the preset format to obtain preprocessed document images;
the official document element region segmentation module 22 is used for detecting each official document element region in the preprocessed official document image through a pre-trained detection model, cutting each official document element region to obtain a corresponding official document element image, and identifying text contents in each official document element image; wherein the document elements include at least one of: number of official documents, secret grade and confidentiality period, emergency degree, issuing organization identification, issuing character number, signing and issuing person, subject word, copying organization, contact person, telephone and mail box;
and the document element content extraction module 23 is configured to extract corresponding document element contents from the text contents in each identified document element image, and output all document element contents in the document in a preset format.
Preferably, the official document preprocessing module 21 includes a format transformant model 211, and the format transformant model 211 is specifically used for:
if the acquired official document is in a document type format, firstly converting the format of the official document into a PDF format, and then converting the PDF format into a preset format image with the image resolution within a preset range;
if the acquired official document is in the picture type format, unifying the format of the official document into a preset format image with the image resolution within a preset range.
Preferably, the pictures in the picture format include a shot picture, a scanned picture, and a copied picture;
the document preprocessing module 21 further includes:
an image enhancer model 212, configured to, if the obtained official document picture type format includes a picture, if the document type format includes a PDF which is a PDF formed by non-document type conversion, adjust image contrast according to an image contrast parameter default value and adjust image brightness according to an image brightness parameter default value for each pixel point of the corresponding preset format image; when the enhancement effect of the preprocessed official document image cannot meet the preset requirement according to the image contrast parameter default value and the image brightness parameter default value, adjusting the image contrast parameter value and the image brightness parameter value until the enhancement effect can meet the preset requirement, and obtaining an enhanced image;
and the image denoising submodel 213 is used for removing noise points of the enhanced image by adopting median filtering to obtain a preprocessed document image with complete character edge reservation and no character information loss.
Preferably, the document element region division module 22 includes:
the document element region identifier model 221 is used for automatically detecting each document element region in the preprocessed document image and corresponding document element types through a pre-trained Yolov5 model to form each document element region output result; the official document element region output result is expressed as: the method comprises the following steps of classifying document elements, detecting frame central point x, detecting frame central point y, detecting frame width w, detecting frame height h and confidence;
the document element region division submodel 222 is used for converting the detection frame center point x value, the center point y value, the detection frame width w and the detection frame height h into a coordinate value format (x) for dividing the corresponding document element region 1 ,y 1 ,x 2 ,y 2 ) B, carrying out the following steps of; wherein x is 1 、y 1 Respectively being document element zone framesThe horizontal and vertical coordinate values of the upper left vertex, x 2 、y 2 Respectively are the horizontal and vertical coordinate values, x, of the lower right vertex of the regional frame of the official document element 1 =x-w/2;y 1 =y-h/2;x 2 =x+w/2;y 2 =y+h/2;
Calling the crop () function of the Image module of the Python Image processing library PIL to obtain the coordinate values (x) of each document element region 1 ,y 1 ,x 2 ,y 2 ) Inputting a crop () function to divide each document element region, returning a corresponding document element image after the division is finished, and marking document element types and confidence degrees in the document element images;
and the document content extraction sub-model 223 is used for identifying and extracting the text content in each document element image by adopting a pytesseract module of Python language and returning the extracted text content to the document element image in a txt file.
Preferably, the document element content extraction module 23 includes:
an extraction submodel 231, configured to, for a txt file of each document element image, directly use a text content as an element content if the text content in the txt file does not include a keyword; if the text content in the txt file contains keywords, taking the keywords and the text content except the redundant punctuations as the extracted element content; extracting element contents in the txt file through a python character string replacing function replace;
an extracted content conversion merging submodel 232, configured to construct a key-value pair for each extracted text content, where a key of the key-value pair represents the document element category, and a value of the key-value pair represents the extracted element content; all the key-value pairs in the document are combined into a dictionary and output
The above technical solutions of the embodiments of the present invention are described in detail below with reference to specific application examples, and reference may be made to the foregoing related descriptions for technical details that are not described in the implementation process.
The invention discloses a method and a system for extracting official document element information based on an image processing technology, relates to the field of natural language processing and the field of computer vision, and is mainly used for extracting official document element information of institutions such as parties, politics, military and the like. Aiming at the condition that the element information structure, collocation, font, size and keywords of the official document are relatively fixed in appearance, the condition that the text information processing is carried out only by using the natural language technology is changed, the problem that the error is easy to occur in the matching and extraction of the regular expression is solved, the generalization capability and the precision are high, and a solution is provided for the large-scale information extraction and the structural transformation of the official document.
The general flow of the technical scheme of the invention is shown in figure 3, and the details are as follows:
and (3) operating environment: windows 10 operating environment, python3.9.7 programming language, Anaconda3 package manager, and environment manager.
Document elements: the system is characterized by comprising official document key information such as official document number of copies, secret grade and confidentiality period, emergency degree, official issuing organ identification, official document number, signer, subject term, copying organ, contact person, telephone, mailbox and the like in the official document.
step1. format conversion
The official document formats of electronic editions are mainly divided into two categories, namely: document office official documents and picture office official documents. The document class mainly comprises: DOC, DOCX, WPS, PDF and other formats; the picture class mainly includes: JPEG, BMP, PNG, TIFF, etc. The document class organization official document format conversion is divided into two steps: firstly, converting DOC, DOCX, WPS and other formats into PDF formats, and converting the PDF formats into JPEG image formats; the picture organ official document format conversion means that organ official documents in formats such as BMP, PNG, TIFF and the like are converted into JPEG images in a unified manner. The two types of official documents are converted to finally form an image with the resolution of 794 × 1120.
step2. image enhancement processing
The official documents of the photo organ are often obtained by scanning and copying, the situations of unclear images and more noise exist in different degrees, and in order to improve the quality of the images, enhancement operation is carried out on the images on the basis of step1. The specific implementation method comprises the following steps: processing each pixel point (i, j) of the image according to a formula (1):
m(i,j)=af(i,j)+b (1)
wherein m (i, j) is the enhanced image, f (i, j) is the original image, the parameter a is used for adjusting the contrast of the image, when a >1 is the enhanced contrast, and when 0< a <1, the contrast is reduced. The parameter b is used for adjusting the brightness of the image, the default value of a is set to 20, the default value of b is set to 30, and the specific numerical value needs to be adjusted according to the brightness and the definition of the original document image.
And after the brightness and the contrast of the image are adjusted, performing noise point removal processing by using a median filtering method. The median filtering method is selected, mainly considering that the noise generated by scanning and copying files is more, and when the method removes the noise of the document image, the method can better protect the text edge and reduce the loss of text information, and the specific implementation method is as follows:
m(i,j)=med{f(i-a,j-b),(a,b∈T)} (2)
wherein m (i, j) is the enhanced image, f (i, j) is the original image, and T is the two-dimensional template.
step3. element region detection
And detecting the document element information area of the document image after the enhancement processing by using a YOLOV5 model. The use of this method is based primarily on two considerations: firstly, the external structure collocation of element information of organ official documents is fixed, the font size of the font is standard, the characteristics are obvious, the boundary of each element information can be framed by a deep learning target detection method, and the model can automatically judge the element type; secondly, the YOLOV5 model is further upgraded and optimized on the basis of inheriting YOLOV4 high precision, low consumption and the like, so that a user can configure the environment more conveniently, a data set is trained, the model is simple, convenient and efficient, the detection target speed is as high as 140FPS, and the model is suitable for being deployed and used in actual business rapidly and specifically as follows:
firstly, a certain number of official document images are selected for marking, and a training sample is generated. Labeling is performed according to the object of each document element, the category name is composed of the name of the document element in English abbreviation, and the labeling range covers the area of the element content, for example, the element content of 'secret level and secret period' is 'secret ^ 2 years'The label area is a text area covering "secret ≧ 2 years", and the category is labeled "urg". If the element contains an element keyword, the element keyword and a separator need to be covered in the labeling area, for example, if the content of the element of the ' issuer ' is ' Zhang III ', and the element keyword is ' issuer ', the labeling area needs to cover ' issuer: zhang III ". During labeling, the area of the labeling frame is minimized as much as possible under the condition that the content is completely covered in the labeling frame, so that the accuracy of model identification is improved. Suppose the coordinate of the upper left corner of the element content is (x) 1 ,y 1 ) The coordinate of the lower right corner is (x) 2 ,y 2 ) If yes, label area a ═ min { (x) 2 -x 1 )(y 2 -y 1 )}。
According to the mode, the document number, the secret grade, the confidentiality period, the emergency degree, the identification of a document issuing organization, the number of a issued character, an issuer, a subject term, a copying organization, a contact person, a telephone, a mailbox and other document elements are marked, and the categories are respectively marked as: "num", "sec", "urg", "ide", "tnu", "sig", "key", "cop", "name", "tel", "email".
The YOLOV5 model was then trained with labeled training samples. In order to ensure that the model speed is faster and meet the requirement of extracting the aging on a large scale, a lightweight version YOLOV5s of a YOLOV5 model is selected, the layer scaling factor of a BottleneckCSP module and the scaling factor of a convolution channel are respectively set to be 0.33 and 0.50, the number of conv convolution kernels is 32, 64, 128, 256 and 512, the step size is 2, a pre-training weight file is selected to be yolov5x.pt, and the default resolution size of an input picture is 640. Training samples are randomly divided into a training set and a verification set according to an 8:2 ratio (8 refers to training and 2 refers to verification models) and input into the models. Before inputting the model, the sample image is held 1120 with the long side unchanged, and the short side 794 is filled with 6 pixels with pure gray pixels, adjusted to a multiple of 32, and becomes an image of 800 × 1120 size, which is used as the input of the model Focus layer. The model cuts the sample image into 400 × 560 × 12 feature maps through 2 times of down sampling, the feature maps are changed into 400 × 560 × 32 feature maps through one convolution operation of 32 convolution kernels, the feature maps are processed and handed to SPP through a 3-layer BCSP structure, the convolution and connection operation of a 14-layer Head structure of the model is finally handed to Detect for output, the steps are repeated, the operation is iterated, and parameters are continuously optimized until the model training is completed.
In the aspect of model performance evaluation, in addition to the traditional accuracy (precision), recall (recall) and average accuracy (mAP), the sum of three indexes of box loss (box _ loss), confidence loss (obj _ loss) and classification loss (cls _ loss) is used for calculating the model loss function, and the calculation is as shown in formula (3).
Loss=box_loss+obj_loss+cls_loss (3)
Confidence loss and classification loss were calculated using two-class cross-entropy loss (BCELoss), while rectangular box loss was calculated using CIOU loss. The conventional rectangular frame loss calculation uses an intersection-to-union ratio IOU, the condition that a target frame and a prediction frame are not overlapped is not considered, and when the target frame and the prediction frame are not overlapped, the gradient of the IOU loss is 0, so that optimization cannot be performed. The CIOU loss considers not only the problem of non-overlapping of the two frames, but also the distance between the center points of the two frames and the aspect ratio of the frames, so that the regression describing the prediction frame is more sufficient, and the calculation is as shown in formula (4).
Figure BDA0003688655140000081
Where ρ is 2 (b,b gt ) The distance between the center points of the predicted frame and the target frame, c the difference between the outer frame containing the predicted frame and the target frame and the union of the predicted frame and the target frame, and v the matching degree of the aspect ratio of the predicted frame and the target frame are calculated as formula (5).
Figure BDA0003688655140000082
Alpha is a balance parameter, and is calculated as formula (6).
Figure BDA0003688655140000083
The IOU is the ratio of the intersection and union of the areas of the prediction box and the target box, and is calculated as formula (7).
Figure BDA0003688655140000091
After the model training is finished, inputting the to-be-processed document map into the model, and predicting the element region and the element type of the region by the model.
Step4. region segmentation recognition
The output result format of the model detection document element area is as follows: (class, detection frame center point x value, center point y value, detection frame width w, detection frame height h and confidence), taking 4 values in the middle of the result, and converting to generate coordinate value format (x) convenient for image segmentation 1 ,y 1 ,x 2 ,y 2 ) Wherein x is 1 、y 1 Respectively are the horizontal and vertical coordinate values, x, of the top left vertex of the element region frame 2 、y 2 Respectively are horizontal and vertical coordinate values of the lower right vertex of the element area frame, and the conversion calculation method comprises the following steps: x is the number of 1 =x-w/2;y 1 =y-h/2;x 2 =x+w/2;y 2 Y + h/2. And after the conversion is finished, performing image segmentation by using the coordinate values to generate a document element image, recording the image element type, and finally performing text recognition and extraction on the document element image.
Text recognition is carried out by adopting text, the text is an open-source Optical Character Recognition (OCR) engine, mainstream platforms such as Windows, Linux and Mac OS are supported at present, the recognition capability of world mainstream language characters is better, element contents can be extracted from a document element JPEG format image through the text to form a txt format text, and a foundation is laid for further matching document element contents.
step5. element content matching
And sorting the element contents of the txt format text obtained by identification, extracting the element contents from the txt format text as values to form key value pairs, and realizing the operation of replacing a function through character strings. The method specifically comprises the following steps: directly taking all the identified contents as extracted contents for the element contents without keywords; if the element content contains a keyword, the keyword and the redundant punctuation mark need to be removed, and the remainder is the extracted content (the removed keyword is not required, and the keyword content is actually replaced with the image element type identified in step4 later when the key value pair is formed). For example, the "number of copies of official document", "urgency", "identification of issuing organization", "secret level and security term", "character number" and the like are elements, and the identification content is the extracted content. Replacing a function place (' parameters 1' and 2') by a python character string for the keyword to be removed and the redundant punctuation mark, wherein the parameter 1 is set as follows: key to be removed and colon after the key, parameter 2: set to null; role of replace ('parameter 1', 'parameter 2'): the parameter 2 replaces the parameter 1 and returns a processing result, wherein the parameters 2 are all null values, and after the parameter 1 is replaced, keywords and colon marks behind the keywords are removed; and outputting the returned result of the replace function as element content, wherein the element content is 'element content 1, element content 2 and element content 3' in the Step6 dictionary format.
Step6. extraction result output
And (3) respectively using the extracted element contents and element categories as values and keys to form a dictionary format: the method comprises the following steps of { "element type 1": element content 1), "element type 2": element content 2), "element type 3": element content 3 ",.
The beneficial effects obtained by the invention are as follows:
the invention is based on the image processing technology, can extract the element information of documents in various formats such as DOC, DOCX, WPS, PDF, JPEG, BMP, PNG, TIFF and the like, and overcomes the limitation that only text documents can be extracted based on a natural language processing method.
According to the method, the regional detection and identification are carried out through the image characteristics of the document elements, the condition that the internal typesetting method of the document is not uniform and the matching by using the traditional regular expression is easy to make mistakes is overcome, and the generalization capability and the accuracy are strong.
The technical solution combined with the present invention is exemplified as follows:
273 parts of document samples with different formats are manufactured in the embodiment, noise points are added to the image document samples, and different gray levels are used for simulating text distortion effects caused by scanning and shooting. The system is constructed by a format distinguishing conversion module, an image enhancement module, an element region detection module, a region segmentation identification module and an element content matching module, and is explained according to flow sub-modules, and main functions are verified experimentally.
1. Format conversion module
Sequentially reading in samples by using a file operation function, judging the type of a document according to the extension name of the document, converting a document with DOC (document access control), DOCX (document data interface), WPS (document data system) formats into a PDF (Portable document format) format by using an ExportAsFixedFormat function of a win32com module client package, wherein an ExportFormat parameter is set to be 17, an Item parameter is set to be 7, then, a get _ pixmap function of a fitz package is used for rendering each A4 page of the document with the PDF format into a JPEG (joint photographic experts group) format image with 794x 1120 pixels, and the image is named in a form of 'file name + page number'; the read documents are directly stored as JPEG images in Image documents of BMP, PNG, TIFF and other formats by using an Image package of a PIL library, and the Image sizes are also uniformly adjusted to 794x 1120 pixels.
2. Image enhancement module
For the image converted from the document of the text class, the brightness and the contrast are kept unchanged, and the denoising processing is not needed, wherein in the formula (1), a is set to be 1, and b is set to be 0; for the image official document, contrast improvement, brightness processing and denoising processing are carried out, in formula (1), a is set to be 20, b is set to be 30, and parameter adjustment is carried out according to the subsequent detection effect. In formula (2), T is selected as a region of 3 × 3 for the case where the document image noise point is small.
3. Element region detection module
Considering that only the method flow is verified, the embodiment selects 2 elements of the contact phone and the mailbox to perform the information extraction test. Labeling telephone and mailbox elements of 273 official document images by using a labelImg labeling tool, wherein labeling type characters are 'tel' and 'email', a data output format is a YOLO format, and then labeling the labeled sample data according to the following steps of 8:2, randomly dividing the sample into a training set containing 218 samples and a training set containing 55 samples, configuring a Yolov5 model according to the parameters of step3, inputting the samples into the model for training after the configuration is finished, wherein the training environment is CPU: intel (R) core (TM) i7-10875H, memory: 32G, display card: NVIDIA GeForce RTX 2070, CUDA 11.6.112, with batch-size:16, epochs:500, works: 4 as training parameters, and the model training results are shown in FIG. 5.
When the training is carried out for 456 generations, the model achieves the best effect, the accuracy rate is 91.2%, the recall rate is 82.8%, the mAP _0.5 is 88.06%, detailed indexes are shown in table 1, and the train/box _ loss, the train/obj _ loss and the train/cls _ loss are respectively rectangular frame loss, confidence coefficient loss and classification loss on a training set; val/box _ loss, val/obj _ loss and val/cls _ loss are rectangular frame loss, confidence loss and classification loss on the verification set respectively. Because training sample data is smaller, the performance of the model has larger promotion space, and the accuracy of model identification can be further promoted by increasing the number of training samples in the later period.
Table 1 evaluation table of model performance indexes
Figure BDA0003688655140000111
After the YOLOV5 model is trained, the image to be detected is input, the input picture size (img-size) is set to 640, the confidence threshold (conf-thres) is set to 0.3, and the NMS IoU threshold is set to 0.5. The model automatically detects the region information and the type of the official document element, the region detection and identification effects are shown in fig. 6, the regions detected by the contact phone and the mailbox element are marked by a rectangular frame, and the element type substitute character and the confidence coefficient are given above the rectangular frame. It should be noted that, because the distance between the contact phone and the mailbox element is small, the situation that the original element content is covered by the label box occurs, but the actual operation effect of the model is not affected.
4. Region segmentation identification module
Dividing document element region, using Image module of Python Image processing library PIL, calling crop () function of the module, and converting coordinate value (x) of element region 1 ,y 1 ,x 2 ,y 2 ) Inputting the function, the graph after segmentation can be returnedAnd storing the image as a JPEG image.
5. Element content matching module
The contents of the contact telephone and the mailbox elements are sorted, and the contact telephone and the mailbox elements contain keywords and punctuations, so the keywords and the colon marks behind the keywords need to be removed. For example, a "telephone: 13800000000 "identified as" telephone: 13800000000 ", then the" telephone "and" need to be removed: ", the extraction is" 13800000000 ". And finally, respectively forming keys and values of the dictionary data by the to-be-extracted element categories and the extracted contents to form a formatted dictionary data output result.
It should be understood that the specific order or hierarchy of steps in the processes disclosed is an example of exemplary approaches. Based upon design preferences, it is understood that the specific order or hierarchy of steps in the processes may be rearranged without departing from the scope of the present disclosure. The accompanying method claims present elements of the various steps in a sample order, and are not intended to be limited to the specific order or hierarchy presented.
In the foregoing detailed description, various features are grouped together in a single embodiment for the purpose of streamlining the disclosure. This method of disclosure is not to be interpreted as reflecting an intention that the claimed embodiments of the subject matter require more features than are expressly recited in each claim. Rather, as the following claims reflect, invention lies in less than all features of a single disclosed embodiment. Thus, the following claims are hereby expressly incorporated into the detailed description, with each claim standing on its own as a separate preferred embodiment of the invention.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
What has been described above includes examples of one or more embodiments. It is, of course, not possible to describe every conceivable combination of components or methodologies for purposes of describing the aforementioned embodiments, but one of ordinary skill in the art may recognize that many further combinations and permutations of various embodiments are possible. Accordingly, the embodiments described herein are intended to embrace all such alterations, modifications and variations that fall within the scope of the appended claims. Furthermore, to the extent that the term "includes" is used in either the detailed description or the claims, such term is intended to be inclusive in a manner similar to the term "comprising" as "comprising" is interpreted when employed as a transitional word in a claim. Furthermore, any use of the term "or" in the specification of the claims is intended to mean a "non-exclusive or".
Those of skill in the art will further appreciate that the various illustrative logical blocks, units, and steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate the interchangeability of hardware and software, various illustrative components, elements, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design requirements of the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present embodiments.
The various illustrative logical blocks, or elements, described in connection with the embodiments disclosed herein may be implemented or performed with a general purpose processor, a digital signal processor, an Application Specific Integrated Circuit (ASIC), a field programmable gate array or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general-purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a digital signal processor and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a digital signal processor core, or any other similar configuration.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may be stored in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. For example, a storage medium may be coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC, which may be located in a user terminal. In the alternative, the processor and the storage medium may reside in different components in a user terminal.
In one or more exemplary designs, the functions described above in connection with the embodiments of the invention may be implemented in hardware, software, firmware, or any combination of the three. If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media that facilitate transfer of a computer program from one place to another. Storage media may be any available media that can be accessed by a general purpose or special purpose computer. For example, such computer-readable media can include, but is not limited to, RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to carry or store program code in the form of instructions or data structures and which can be read by a general-purpose or special-purpose computer, or a general-purpose or special-purpose processor. Additionally, any connection is properly termed a computer-readable medium, and, thus, is included if the software is transmitted from a website, server, or other remote source via a coaxial cable, fiber optic cable, twisted pair, Digital Subscriber Line (DSL), or wirelessly, e.g., infrared, radio, and microwave. Such discs (disk) and disks (disc) include compact disks, laser disks, optical disks, DVDs, floppy disks and blu-ray disks, where magnetic discs generally reproduce data magnetically, while disks generally reproduce data optically with lasers. Combinations of the above may also be included in the computer-readable medium.
The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (10)

1. An image-based method for extracting document element information, comprising:
acquiring documents generated by an organization, converting documents in different storage formats into images in a preset format, and preprocessing the images in the preset format to obtain preprocessed document images;
detecting each official document element area in the preprocessed official document image through a pre-trained detection model, cutting each official document element area to obtain a corresponding official document element image, and identifying text contents in each official document element image; wherein the document elements include at least one of: number of official documents, secret grade and confidentiality period, emergency degree, issuing organization identification, issuing character number, signing and issuing person, subject word, copying organization, contact person, telephone and mail box;
and extracting corresponding document element contents aiming at the text contents in each identified document element image, and outputting all document element contents in the document according to a preset format.
2. The method for extracting information of an image-based official document element according to claim 1, wherein the converting of official documents in different storage formats into image formats comprises:
if the acquired official document is in a document type format, firstly converting the format of the official document into a PDF format, and then converting the PDF format into a preset format image with the image resolution within a preset range;
if the acquired official document is in the picture type format, unifying the format of the official document into a preset format image with the image resolution within a preset range.
3. The method for extracting information of an image-based official document element according to claim 2, wherein the pictures in the picture-like format include a shot picture, a scanned picture, and a copied picture;
the preprocessing the preset format image further comprises:
if the acquired official document photo type format and the official document in the document type format contain the photo, if the document type format is PDF and the PDF is formed by non-document type conversion, adjusting the image contrast and the image brightness according to the image contrast parameter default value and the image brightness parameter default value for each pixel point of the corresponding preset format image; when the enhancement effect of the preprocessed official document image cannot meet the preset requirement according to the image contrast parameter default value and the image brightness parameter default value, adjusting the image contrast parameter value and the image brightness parameter value until the enhancement effect can meet the preset requirement, and obtaining an enhanced image;
and removing noise points of the enhanced image by adopting median filtering to obtain a preprocessed document image with complete character edge reservation and no character information loss.
4. The image-based document element information extraction method according to claim 1, wherein the detecting, by a pre-trained detection model, each document element region in the preprocessed document image, and cutting each document element region to obtain a corresponding document element image, and recognizing text content in each document element image specifically comprises:
automatically detecting each document element region in the preprocessed document image and corresponding document element types through a pre-trained YOLOV5 model to form output results of each document element region; the official document element region output result is expressed as: the method comprises the following steps of classifying document elements, detecting frame central point x, detecting frame central point y, detecting frame width w, detecting frame height h and confidence;
converting the central point x, the central point y, the width w and the height h of the detection frame into coordinate value formats (x, y) for dividing the corresponding document element areas 1 ,y 1 ,x 2 ,y 2 ) B, carrying out the following steps of; wherein x is 1 、y 1 Respectively the horizontal and vertical coordinate values, x, of the top left vertex of the regional frame of the document element 2 、y 2 Respectively are the horizontal and vertical coordinate values, x, of the lower right vertex of the document element region frame 1 =x-w/2;y 1 =y-h/2;x 2 =x+w/2;y 2 =y+h/2;
Calling the crop () function of the Image module of the Python Image processing library PIL to obtain the coordinate values (x) of each document element region 1 ,y 1 ,x 2 ,y 2 ) Inputting a crop () function to divide each document element region, returning a corresponding document element image after the division is finished, and marking document element types and confidence degrees in the document element images;
and aiming at each document element image, identifying and extracting the text content in the document element image by adopting a pytesseract module of Python language, and returning the extracted text content in a txt file.
5. The image-based official document element information extraction method according to claim 4, wherein the extracting of the corresponding official document element content from the text content in each identified official document element image and the outputting of all the extracted official document element contents in the official document according to a preset format specifically comprises:
aiming at the txt file of each document element image, if the text content in the txt file does not contain keywords, directly taking the text content as element content; if the text content in the txt file contains keywords, taking the keywords and the text content except the redundant punctuations as the extracted element content; extracting element contents in the txt file through a python character string replacing function replace;
constructing a key value pair aiming at each extracted text content, expressing the document element category through a key of the key value pair, and expressing the extracted element content through the value of the key value pair; and composing all key-value pairs in the document into a dictionary and outputting the dictionary.
6. An image-based document element information extraction device, comprising:
the document preprocessing module is used for acquiring documents generated by an organization, converting the documents with different storage formats into images with preset formats and preprocessing the images with the preset formats to obtain preprocessed document images;
the official document element region segmentation module is used for detecting each official document element region in the preprocessed official document image through a pre-trained detection model, cutting each official document element region to obtain a corresponding official document element image, and identifying text contents in each official document element image; wherein the document elements include at least one of: number of copies of official documents, secret grade and confidentiality period, emergency degree, identification of issuing organization, number of issued characters, signer, subject term, copying organization, contact person, telephone and mailbox;
and the document element content extraction module is used for extracting corresponding document element contents aiming at the text contents in each identified document element image, and outputting all document element contents in the document according to a preset format.
7. The image-based document element information extraction device according to claim 6, wherein the document preprocessing module comprises a format conversion submodel, and the format conversion submodel is specifically configured to:
if the acquired official document is in a document type format, firstly converting the format of the official document into a PDF format, and then converting the PDF format into a preset format image with the image resolution within a preset range;
if the acquired official document is in the picture type format, unifying the format of the official document into a preset format image with the image resolution within a preset range.
8. The apparatus for extracting information on an image-based document element according to claim 7, wherein the pictures in the picture-like format include a photographed picture, a scanned picture, and a copied picture;
the official document preprocessing module further comprises:
the image enhancement submodel is used for adjusting the image contrast of each pixel point of the corresponding preset format image according to the default value of the image contrast parameter and adjusting the image brightness according to the default value of the image brightness parameter if the acquired official document picture type format and the official document of the document type format contain pictures and if the document type format is PDF and the PDF is formed by non-document type conversion; when the enhancement effect of the preprocessed official document image cannot meet the preset requirement according to the image contrast parameter default value and the image brightness parameter default value, adjusting the image contrast parameter value and the image brightness parameter value until the enhancement effect can meet the preset requirement, and obtaining an enhanced image;
and the image denoising submodel is used for removing noise points of the enhanced image by adopting median filtering to obtain a preprocessed document image with complete character edge reservation and no character information loss.
9. The image-based document element information extraction device according to claim 6, wherein the document element region division module includes:
the document element region identifier model is used for automatically detecting each document element region in the preprocessed document image and corresponding document element types through a pre-trained Yolov5 model to form each document element region output result; the official document element region output result is expressed as: the method comprises the following steps of (1) classifying document elements, detecting frame central point x, detecting frame central point y, detecting frame width w, detecting frame height h and confidence;
the document element region division submodel is used for converting a detection frame central point x value, a central point y value, a detection frame width w and a detection frame height h into a coordinate value format (x) for dividing the corresponding document element region 1 ,y 1 ,x 2 ,y 2 ) B, carrying out the following steps of; wherein x is 1 、y 1 Respectively the horizontal and vertical coordinate values, x, of the top left vertex of the regional frame of the document element 2 、y 2 Respectively are the horizontal and vertical coordinate values, x, of the lower right vertex of the regional frame of the official document element 1 =x-w/2;y 1 =y-h/2;x 2 =x+w/2;y 2 =y+h/2;
Calling the crop () function of the Image module of the Python Image processing library PIL to obtain the coordinate values (x) of each document element region 1 ,y 1 ,x 2 ,y 2 ) Inputting a crop () function to segment each document element region, returning a corresponding document element image after the segmentation is finished, and marking document element types and confidence degrees in the document element images;
and the document content extraction sub-model is used for identifying and extracting the text content in each document element image by adopting a pytesseract module of Python language and returning the extracted text content in a txt file.
10. The image-based document element information extraction device according to claim 9, wherein the document element content extraction module includes:
extracting a sub-model, wherein the sub-model is used for a txt file of each document element image, and if the text content in the txt file does not contain keywords, the text content is directly used as element content; if the text content in the txt file contains keywords, taking the keywords and the text content except the redundant punctuations as the extracted element content; extracting element contents in the txt file through a python character string replacing function replace;
extracting a content conversion merging submodel, constructing a key value pair aiming at each extracted text content, expressing the document element category through the key of the key value pair, and expressing the extracted element content through the value of the key value pair; and composing all key-value pairs in the document into a dictionary and outputting the dictionary.
CN202210657233.3A 2022-06-10 2022-06-10 Image-based official document element information extraction method and device Pending CN115116079A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210657233.3A CN115116079A (en) 2022-06-10 2022-06-10 Image-based official document element information extraction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210657233.3A CN115116079A (en) 2022-06-10 2022-06-10 Image-based official document element information extraction method and device

Publications (1)

Publication Number Publication Date
CN115116079A true CN115116079A (en) 2022-09-27

Family

ID=83325803

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210657233.3A Pending CN115116079A (en) 2022-06-10 2022-06-10 Image-based official document element information extraction method and device

Country Status (1)

Country Link
CN (1) CN115116079A (en)

Similar Documents

Publication Publication Date Title
US11645826B2 (en) Generating searchable text for documents portrayed in a repository of digital images utilizing orientation and text prediction neural networks
CN109543690B (en) Method and device for extracting information
US11138423B2 (en) Region proposal networks for automated bounding box detection and text segmentation
US10621727B1 (en) Label and field identification without optical character recognition (OCR)
US8271871B2 (en) Automated method for alignment of document objects
US9626555B2 (en) Content-based document image classification
US8538184B2 (en) Systems and methods for handling and distinguishing binarized, background artifacts in the vicinity of document text and image features indicative of a document category
US11816883B2 (en) Region proposal networks for automated bounding box detection and text segmentation
US11379690B2 (en) System to extract information from documents
CN103996055B (en) Recognition methods based on grader in image file electronic bits of data identifying system
US8953228B1 (en) Automatic assignment of note attributes using partial image recognition results
CN103995904A (en) Recognition system for image file electronic data
Igorevna et al. Document image analysis and recognition: a survey
CN117076455A (en) Intelligent identification-based policy structured storage method, medium and system
US20240144711A1 (en) Reliable determination of field values in documents with removal of static field elements
CN115116079A (en) Image-based official document element information extraction method and device
US20230343123A1 (en) Using model uncertainty for contextual decision making in optical character recognition
US20240202517A1 (en) Document processing with efficient type-of-source classification
CN113537225B (en) Method for character recognition, electronic device, and storage medium
WO2023062799A1 (en) Information processing system, manuscript type identification method, model generation method and program
Perel et al. Learning multimodal affinities for textual editing in images
Butt et al. Attention-Based CNN-RNN Arabic Text Recognition from Natural Scene Images. Forecasting 2021, 3, x
KR20240015674A (en) Method, apparatus, and computer-readable storage medium for recognizing characters in digital documents
van de Voort Optimized OCR for Maritime Compliance using Deep Learning
CN115063818A (en) Method and system for distinguishing type of confidential documents

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination