CN113554027A

CN113554027A - Method for calibrating and extracting text information of reimbursement receipt image

Info

Publication number: CN113554027A
Application number: CN202110910082.3A
Authority: CN
Inventors: 胡为民; 郑喜
Original assignee: Shenzhen Dib Enterprise Risk Management Technology Co ltd
Current assignee: Shenzhen Dib Enterprise Risk Management Technology Co ltd
Priority date: 2021-08-09
Filing date: 2021-08-09
Publication date: 2021-10-26

Abstract

A method for calibrating and extracting text information of reimbursement document images comprises the following steps: 1) disclosed is a method for filtering noise of an reimbursement receipt image. The method is based on an image preprocessing algorithm of OTSU threshold segmentation and EDT distance conversion, noise such as seals, ink dots and wrinkles existing in reimbursement document images is filtered, and the filtered images are used as input of an image text information calibration module. 2) A method for reimbursing image text information calibration. The method adopts a maximum connected domain algorithm to extract a user information connected domain from an reimbursement document image, constructs a correlation matrix according to the corresponding relation between the user information and a document field, and marks and represents the relevance between the reimbursement document user information and a template field connected domain; data enhancement is carried out through disturbance processing such as random rotation, scaling, Gaussian noise and the like, and an SSD network is adopted to train a reimbursement image text information calibration model; and identifying optical character information in the document connected domain by Tesseract, and realizing calibration and extraction of the reimbursement document image text information.

Description

Method for calibrating and extracting text information of reimbursement receipt image

Technical Field

The invention relates to the technical fields of image recognition, machine learning and the like, in particular to a method for calibrating and extracting text information of an reimbursement receipt image.

Background

In management activities such as financial reconciliation and the like, the consistency of electronic document data and paper document data in a financial information management system needs to be compared, and the correctness and the authenticity of financial activities are ensured, however, the situation that the electronic document data and the paper document are inconsistent in comparison often occurs in the practical situation, and the efficiency of financial management is serious. At present, electronic document comparison and paper document comparison are mainly executed in a manual auditing mode of financial staff, so that the efficiency is low and errors still exist.

The optical character recognition technology is one of the core technologies of the electronization of the paper documents at present, provides a feasible technical path for completing the comparison between the paper documents and the electronic documents, but has some technical difficulties: (1) the image of the paper reimbursement document has more noises, such as stains generated in the production process of paper, unclear ink marks generated in the printing process of the reimbursement document, seals generated in the auditing process of the reimbursement document and the like, and the noises very affect the identification accuracy of optical characters; (2) the printing process of the paper reimbursement receipt is affected by the placement position of the printing paper, the condition of misalignment and even dislocation often occurs, and great technical difficulty is brought to accurate identification and extraction of receipt user information.

Therefore, aiming at the technical problems, the invention realizes a method for calibrating and extracting the text information of the image of the paper reimbursement receipt based on the image recognition and machine learning technology, realizes the extraction and alignment of the user information of the image of the paper receipt, compares the user information with the electronic receipt of the financial system, automatically checks the consistency of the paper receipt and the electronic receipt, reduces the financial management cost, improves the financial management efficiency and has higher practical value.

Disclosure of Invention

Aiming at the problems of noise caused by formats and seals in the electronization process of paper reimbursement documents and the problem of dislocation of user information and template fields caused by printing, the invention aims to calibrate and extract the user information in reimbursement document images through image processing, identification and other technologies, generate electronic documents to automatically check, and improve the bill checking efficiency of financial staff. The invention comprises the following steps:

(1) method for filtering noise of reimbursement receipt image

(2) Reimbursement image text information calibration method

The method comprises the following specific steps:

(1) method for filtering noise of reimbursement receipt image

The method is based on an image preprocessing algorithm of OTSU threshold segmentation and EDT distance transformation, noise such as seals, ink dots and wrinkles existing in reimbursement document images is filtered, and the filtered images are used as input of an image text information calibration method.

Extracting pixel matrixes of three color channels of RGB of an original image, performing binarization processing by adopting an OTSU threshold segmentation algorithm to generate a binary mask matrix, filtering the original image, reserving a dark color part which may be a reimbursement document character in the image, and removing noise interference of color information such as a seal.

And adopting EDT distance transformation to the filtered image to generate an Euclidean distance gray image which takes the pixels of the reimbursement document characters reserved by the binary image as targets, wherein the pixel value of the gray image is the distance between the gray image and the nearest target point, adopting an OTSU threshold segmentation algorithm again, setting the threshold to be half of the stroke width of the document printing form, obtaining the extracted binary image, and realizing character refinement.

And performing morphological opening operation on the refined binary image, eliminating isolated and narrow ink dots and seal noise in the image, separating and reimbursing the object adhesion at fine connection positions of characters and characters in the document, characters and form frame lines and the like, and eliminating the convex part of a larger area such as the tail end of a stroke.

Extracting characters in the binary image by adopting a contour extraction algorithm, obtaining a minimum external rectangle of a character connected domain by adopting a maximum connected domain algorithm, setting an aspect ratio threshold of a character rectangular frame, filtering the residual non-character connected domains, and obtaining key fields of the reimbursement receipt and user information after noise filtration.

(2) Method for calibrating reimbursement receipt image text information

In order to realize the calibration and extraction of the reimbursement document information, the invention provides an SSD network-based method for semantic correlation registration of an optical character connected domain and reimbursement document image text information identification based on Tesseract optical character identification. The method takes user information in a document as a detection target, adopts a maximum connected domain algorithm to extract a user information connected domain from an reimbursement document image, constructs a correlation matrix according to the corresponding relation between the user information and a document field, and marks and represents the relevance between the reimbursement document user information and a template field connected domain; based on the connected domain correlation matrix, performing data enhancement through disturbance processing such as random rotation, scaling, Gaussian noise, cutting and the like, and training a reimbursement image text information calibration model by adopting an SSD network for aligning and calibrating user information and template fields; and identifying the optical character information in the document text box by Tesseract, comparing the identification text with the electronic document data, and calibrating and extracting the reimbursement document image text information.

The SSD target detection network comprises a backbone network and a multi-scale feature extraction network, wherein the SSD target detection backbone network is formed by removing full connection layers for classification from a convolutional neural network ResNet50, the multi-scale feature extraction network is a multi-layer down-sampling network, and target frames extracted from different scales are realized by outputting the target frames at each down-sampling layer. The correlation extractor is composed of VGG16 and a fully connected output layer, the input of the correlation extractor is a pair of candidate connected domain images, namely feature maps with the channel number of 2, the feature maps are tiled into 1-dimensional feature vectors after passing through VGG16 and input into the fully connected layer, the fully connected layer is composed of a feature mixed layer of 512 neurons and an output layer of 1 neuron, and the final output of the network is a candidate connected domain correlation value.

The SSD network input comprises two images of user information and a template field connected domain image which are extracted from a reimbursement document image and subjected to noise filtering, the two images are scaled to 256x256 size by adopting a bilinear interpolation algorithm, and min-max normalization processing is carried out; the SSD network output comprises n (n represents the number of connected domains) candidate connected domain coordinates to be detected and n-n correlation matrixes among the candidate connected domains.

The SSD network architecture includes two branches: a branch is a correlation extractor, which is used for calculating a correlation coefficient between the user information and the template field connected domain, inputting n x n connected domain images which are paired pairwise, and outputting an n x n correlation matrix M, wherein the meaning of the correlation matrix M is the matching probability of each pair of user information and the template field connected domain; the other branch connects the output layer through the full link layer, and the output text box position is in the center point-width height (x, y, w, h).

And selecting n values with the maximum correlation probability values in the n columns from the output correlation matrix M, namely the n groups of user information and the matching alignment relation of the template field connected domain. Cutting out a rectangular frame local image from the original document of the reimbursement document at the n user information connected domain positions (x, y, w, h), respectively inputting the image into a Tesseract model to identify optical character information in the image, and calibrating the identified text information with corresponding information in the electronic document.

Drawings

FIG. 1 is a schematic overall flow chart of the present invention.

FIG. 2 is a diagram of a network architecture for a reimbursement image text information calibration method.

Detailed Description

The following describes embodiments of the present invention in further detail with reference to the accompanying drawings. As shown in fig. 1, the invention provides an optical character connected domain semantic correlation registration based on an SSD network and a reimbursement document image text information calibration and extraction method based on Tesseract optical character recognition, and the method includes a reimbursement document image noise filtering method and a reimbursement image text information calibration method. The method for filtering the noise of the reimbursement document image comprises the steps of carrying out image denoising on an original reimbursement document image, filtering noises such as seals, ink dots and wrinkles existing in the reimbursement document image, and inputting the filtered image into a reimbursement image text information calibration method. The method is used for calibrating the format of an reimbursement document image, and as shown in figure 2, based on an SSD target detection network and a correlation extractor, a registration model of user information and template fields in the reimbursement document image is trained in a multitask mode by utilizing a template document containing marking information, an reimbursement document image, and calibration is carried out on the basis of Tesseract optical character recognition technology and data in a financial information management system. The method comprises the following specific steps:

the method comprises the following steps: reimbursement document image binarization

Extracting pixel matrixes of three color channels of RGB of an original image, performing binarization processing by adopting an OTSU threshold segmentation algorithm to generate a binary mask matrix, filtering the original image, reserving a dark color part which may be a reimbursement document character in the image, and removing noise interference of color information such as a seal. The threshold segmentation calculation formula is as follows:

wherein img (x, y) is the pixel value of the binary image under the x, y coordinates, and thresh is the threshold (according to the actual reimbursement document image, set according to experience).

Step two: text image extraction and refinement

EDT distance transformation is carried out on the filtered image to generate an Euclidean distance gray level image which takes reimbursement document character pixels reserved by a binary image as targets, wherein the pixel value of the gray level image is the distance between the gray level image and the nearest target point, and the formula is as follows:

gray(p)|_p＝p(x,y)＝min_q∈T(ED(p,q))

wherein gray (p) is any pixel point in the gray-scale image, T is all pixel points with a value of 1 in the binary image, and ED is Euclidean distance. And after distance transformation is carried out, setting a threshold value to be half of the stroke width of the document print by adopting an OTSU threshold value segmentation algorithm again, obtaining an extracted binary image, and realizing character refinement.

Step three: image noise filtering and text adhesion separation

And performing morphological opening operation on the refined binary image, eliminating isolated and narrow ink dots and seal noise in the image, separating fine connection parts of characters and characters, the characters and form frame lines and the like from adhesion, and eliminating convex parts of a larger area such as the tail end of a stroke.

Step four: character connected domain extraction of key information

Step five: correlation mark of reimbursement document image user information and template field connected domain

And constructing a training data set of the reimbursement bill image calibration model, wherein the training data set comprises common reimbursement bill image data, and each bill image comprises a template field, user information and coordinate labels, such as bill ID, date, address, amount and the like. Extracting a user information connected domain from an reimbursement receipt image by adopting a maximum connected domain algorithm, constructing a correlation matrix according to the corresponding relation between the user information and a receipt field, marking and representing the correlation between the reimbursement receipt user information and a template field connected domain, setting the correlation between the user information and the template field corresponding connected domain as 1, setting the correlation between a non-corresponding connected domain as 0, and generating a large number of corresponding connected domain coordinates and correlation labels as a training set of a text information calibration model.

Step six: training data enhancement

And (3) carrying out disturbance processing such as random rotation, scaling, Gaussian noise, cutting and the like on the reimbursement receipt image in the training set to enhance data so as to reduce overfitting and improve the accuracy of model calibration.

Step seven: reimbursement document calibration model training based on SSD network

An SSD network is adopted to train a reimbursement image text information calibration model, SSD network input comprises two images of user information and a template field communication domain image which are extracted from an reimbursement document image and subjected to noise filtering, the two images are scaled to 256x256 size by adopting a bilinear interpolation algorithm, and min-max normalization processing is carried out; the SSD network output comprises n (n represents the number of connected domains) candidate connected domain coordinates to be detected and n-n correlation matrixes among the candidate connected domains.

The SSD network extracts n characteristic graphs from n connected domains, and performs alignment calibration processing on user information and template fields by adding a correlation extractor. The SSD network architecture includes two branches: a branch is a correlation extractor, which is used for calculating a correlation coefficient between the user information and the template field connected domain, inputting n x n connected domain images which are paired pairwise, and outputting an n x n correlation matrix M, wherein the meaning of the correlation matrix M is the matching probability of each pair of user information and the template field connected domain; the other branch connects the output layer through the full link layer, and the output text box position is in the center point-width height (x, y, w, h).

And selecting n values with the maximum correlation probability values in the n columns from the output correlation matrix M, namely the n groups of user information and the matching alignment relation of the template field connected domain. Probability value matching is calculated as follows:

wherein

For matching user information, the argmax function is the row number index of the maximum value in a column

And carrying out regression training on the candidate connected domain by using a Smooth L1 Loss function, carrying out Softmax function operation on each column by using the correlation matrix, and carrying out classification training by using a cross entropy Loss function. The specific training loss function formula is as follows:

wherein: y represents a symbolic function (0 or 1), if the same text box is 1, otherwise, the same text box is 0;

a candidate box representing a prediction; g represents a true candidate box; c represents the predicted relevance; (i, j) denotes the row and column indices of the correlation matrix.

Step eight: extraction and comparison of reimbursement receipt image information

Cutting out a rectangular frame local image from the original image of the reimbursement document at the n user information connected domain positions (x, y, w, h), respectively inputting the local image into a Tesseract model to identify optical character information in the Tesseract model, and calibrating the identified text information with corresponding data in the electronic document of the financial system.

Claims

1. A method for calibrating and extracting text information of reimbursement document images is characterized by comprising the following steps:

1) the method is based on an image preprocessing algorithm of OTSU threshold segmentation and EDT distance transformation, and is used for filtering noises such as seals, ink dots, wrinkles and the like existing in an reimbursement document image, and taking the filtered image as the input of an image text information calibration module;

2) the method comprises the steps of taking user information in a document as a detection target, adopting a maximum connected domain algorithm to extract a user information connected domain from an reimbursed document image, constructing a correlation matrix according to the corresponding relation between the user information and a document field, and marking and representing the correlation between reimbursed document user information and a template field connected domain; based on the connected domain correlation matrix, performing data enhancement through disturbance processing such as random rotation, scaling, Gaussian noise, cutting and the like, and training a reimbursement image text information calibration model by adopting an SSD network for aligning and calibrating user information and template fields; and identifying the optical character information in the document text box by Tesseract, comparing the identification text with the electronic document data, and calibrating and extracting the reimbursement document image text information.

2. The method for calibrating and extracting the text information of the reimbursement document image according to claim 1, comprising the steps of: the method for filtering the noise of the reimbursement receipt image comprises the following steps:

1) extracting pixel matrixes of three color channels of RGB of an original image, performing binarization processing by adopting an OTSU threshold segmentation algorithm to generate a binary mask matrix, filtering the original image, reserving dark character parts in the image, and removing noise interference of color information such as a seal;

2) EDT distance transformation is carried out on the filtered image, an OTSU threshold segmentation algorithm is adopted again, a threshold is set to be half of the stroke width of a document printing form, an extracted binary image is obtained, character refinement is achieved, and adhesion of objects is removed;

3) and extracting characters in the binary image by adopting a contour extraction algorithm, obtaining a minimum circumscribed rectangle of a character connected domain by adopting a maximum connected domain algorithm, and obtaining key fields of the reimbursement receipt and user information after noise filtration.

3. The method for calibrating and extracting the text information of the reimbursement document image according to claim 1, comprising the steps of: the reimbursement image text information calibration method comprises the following steps:

1) and constructing a target connected domain mark training set, and generating document image data by using various common templates, wherein each document image comprises a plurality of fields and coordinate marks thereof, and the fields correspond to user information and comprise document ID, date, address, user information and the like. Extracting a user information connected domain from an reimbursement receipt image by adopting a maximum connected domain algorithm, constructing a correlation matrix according to the corresponding relation between the user information and a receipt field, marking and expressing the correlation between the reimbursement receipt user information and a template field connected domain, setting the correlation between the user information and the corresponding connected domain of the template field as 1, setting the correlation between a non-corresponding connected domain as 0, and generating a large number of coordinates and correlation marks of the corresponding connected domain as a training set of a text information calibration model;

2) carrying out disturbance processing such as random rotation, scaling, Gaussian noise, cutting and the like on the images in the training set to enhance data so as to reduce overfitting and improve the accuracy of model calibration;

3) an SSD network is adopted to train a reimbursement image text information calibration model, SSD network input comprises two images of user information and a template field communication domain image which are extracted from an reimbursement document image and subjected to noise filtering, the two images are scaled to 256x256 size by adopting a bilinear interpolation algorithm, and min-max normalization processing is carried out; the SSD network output comprises n (n represents the number of connected domains) candidate connected domain coordinates to be detected and n-n correlation matrixes among the candidate connected domains. Performing regression training on the candidate connected domain by using a Smooth L1 Loss function;

4) the SSD network extracts n characteristic graphs from n connected domains, and performs alignment calibration processing on user information and template fields by adding a correlation extractor. The SSD network architecture includes two branches: a branch is a correlation extractor, which is used for calculating a correlation coefficient between the user information and the template field connected domain, inputting n x n connected domain images which are paired pairwise, and outputting an n x n correlation matrix M, wherein the meaning of the correlation matrix M is the matching probability of each pair of user information and the template field connected domain; the other branch is connected with an output layer through a full connecting layer, and the position of the output text box is as the center point-width height (x, y, w, h);

5) after the model SSD network training converges, selecting n values with the maximum relevance probability value in the n columns from the output relevance matrix M, namely the matching alignment relation between n groups of user information and the template field connected domain;

6) cutting out a rectangular frame local image from the original document of the reimbursement document at the n user information connected domain positions (x, y, w, h), respectively inputting the image into a Tesseract model to identify optical character information in the image, and calibrating the identified text information with corresponding information in the electronic document.