CN114328845A

CN114328845A - Automatic structuralization method and system for key information of document image

Info

Publication number: CN114328845A
Application number: CN202210249964.4A
Authority: CN
Inventors: 王燚; 王伟; 饶顶锋; 陶坚坚; 刘伟
Original assignee: Beijing Yitu Zhixun Technology Co ltd
Current assignee: Beijing Yitu Zhixun Technology Co ltd
Priority date: 2022-03-15
Filing date: 2022-03-15
Publication date: 2022-04-12
Anticipated expiration: 2042-03-15
Also published as: CN114328845B

Abstract

The invention provides a method and a system for automatically structuring key information of a document image, and belongs to the technical field of character recognition. The method includes the steps that characters in a file are recognized through optical characters, the characters are arranged into text blocks, then the text blocks are segmented through a text segmentation model and a text segmentation model dictionary, the text blocks are classified through a text classification model and a text classification model dictionary, finally the text blocks are predicted through a prediction model and a prediction model dictionary, and key value pair data meeting rules are extracted according to prediction results; and displaying the extracted structured data after the extracted structured data is subjected to preset format processing. The invention can realize the identification of any file type, and achieve the structured identification method of the automatic structured output result, is suitable for most common voucher reports with various styles such as list type, form type and the like, can adapt to the complex scenes of various voucher reports, and uniformly finishes the automatic structured output without the method configuration and adjustment of a user.

Description

Automatic structuralization method and system for key information of document image

Technical Field

The invention relates to the technical field of character recognition, in particular to a method and a system for automatically structuring key information of a document image.

Background

Computer character Recognition, commonly known as Optical character Recognition, and english full name Optical Character Recognition (OCR), is a technology of extracting characters on drawings in a text form by using Optical technology and computer technology, and converting the characters into a format which can be understood by people. In the information society age, a large amount of bill, form and certificate data are generated every day, and the data need to be electronized and extracted and input by using an optical character recognition technology.

With the development of industries and the maturity of technologies, optical character recognition is currently applied to a plurality of industries, such as sorting and express delivery in the logistics field, license plate recognition in the traffic field, check document recognition and input in the financial field, and the like. The optical character recognition result is typically a semi-structured output that is output in rows.

Generally, the result of optical character recognition is a simple line-by-line character, which makes business processing difficult, if a key-value structure can be recognized, such as recognizing a train ticket, various information on the train ticket is fixed, and if a line of recognized line-by-line characters is directly processed, the processing is troublesome because each field needs to be cut out from the characters.

The prior art has the following defects:

1. the optical character recognition processing method can only realize structured output aiming at the text content of a fixed type, and cannot realize fully automatic structured output;

2. there is a limit to the type of input file, which needs to be a preset file type.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides an automatic structuralization method and system for key information of a document image, which adopts optical characters to recognize characters in a file, arranges the characters into text blocks, performs text block segmentation through a text segmentation model and a text segmentation model dictionary, classifies the text blocks through a text classification model and a text classification model dictionary, predicts the text blocks through a prediction model and a prediction model dictionary, and extracts key value pair data which accord with rules according to a prediction result; and displaying the extracted structured data after the extracted structured data is subjected to preset format processing. The invention can realize the identification of any file type, and achieve the structured identification method of the automatic structured output result, is suitable for most common voucher reports with various styles such as list type, form type and the like, can adapt to the complex scenes of various voucher reports, and uniformly finishes the automatic structured output without the method configuration and adjustment of a user.

The invention provides an automatic structuring method of key information of a document image, which comprises the following steps:

s100: acquiring sample image data of a document;

s300: carrying out direction correction and gradient correction preprocessing on the sample image;

s400: identifying characters in the sample image by adopting optical character identification, and arranging the characters into a text form according to lines; wherein, also include filtering the seal;

s500: preprocessing a text to obtain text data taking a text block as a unit;

s600: combining file data with text blocks as units with a model dictionary of a text segmentation model, converting each text block into a digital sequence, obtaining a mask sequence, a segment sequence and a label sequence corresponding to each digital sequence, inputting the digital sequences into a machine learning model for processing, reducing the machine learning model output according to the mask sequence to obtain a processing result of each text block, and segmenting the text blocks according to the label sequence;

s700: classifying the segmented texts according to a text classification model dictionary, and integrating the classified texts into a one-dimensional array;

s800: generating structured extraction input information according to the distance, the width and the height among the text blocks and the one-dimensional array output in S700, inputting the structured extraction input information into a structured extraction model, predicting the text blocks, and extracting key value pair data which accord with rules according to a prediction result;

s900: and displaying the extracted structured data after the extracted structured data is subjected to preset format processing.

Preferably, step S100 includes the steps of:

s101: reading file data of files in various file formats;

s102: the file is divided into single pages by setting the ID of each page of file data in the file, and then each single page is converted into image data.

Preferably, between S100 and S300, S200 is further included, and the general text recognition model, the text segmentation model, the text classification model, the text structured extraction model and the configuration files thereof are loaded, and are respectively used for text recognition, text segmentation, text classification, and text structured extraction.

Preferably, step S300 specifically includes the following steps:

s301: judging whether the image is in a transverse direction or a longitudinal direction through layout analysis, judging whether the image is in a forward direction or a reverse direction through optical character recognition, and enabling the directions of the images to be consistent through image rotation;

s302: the image inclination angle is calculated by utilizing the frame line information or the text information, and the image inclination angle is specifically calculated by checking whether the inclination of characters and the frame line conforms to a preset normal inclination range through a universal text recognition model, comparing the inclination with a preset normal arrangement form of the characters to calculate the inclination angle, and rotating the image according to the inclination angle to eliminate the image inclination.

Preferably, step S500 specifically includes:

s501: performing text analysis on the text after the optical character recognition and the arrangement, and performing the following processing on the contents of all text blocks, including:

restoring the word order through the relative position;

removing part of illegal characters;

clearing an empty text block;

arranging the texts after the optical character recognition in a row unit from left to right and from top to bottom according to the position information, and arranging the texts by taking the cells as the unit if the form exists in the middle;

s502: the text data sorted by lines obtained in step S501 is further processed, and block processing is performed according to the change of the distance between the texts in the lines, so as to obtain a group of text data with text blocks as a unit, where the number of text blocks is represented by S _ N.

Preferably, step S600 specifically includes the following steps:

s601: loading a text segmentation model configuration, a text segmentation model dictionary and segmentation pre-training word vectors, wherein the word vector dimension value is represented by S _ D, and the text segmentation model dictionary comprises S _ K labels for defining word part of speech;

s602: converting S _ N text blocks in the text data with the text block as the unit obtained in the S502 into corresponding dictionary indexes one by one through a text segmentation model dictionary one by one with the single character as the unit, converting the single character index array of each S _ N text block into an S _ N group first digital sequence S _ data1_ in according to the sequence longest length S _ L preset by the text segmentation model, and constructing a first mask sequence S _ data1_ mask, a first segment sequence S _ data1_ segment and a first tag sequence S _ data1_ label of the S _ N group first digital sequence S _ data1_ in;

s603: sequentially selecting S _ M groups of first digital sequences from S _ N groups of first digital sequences according to a preset batch processing length S _ M of the text segmentation model, combining each group of digital sequences in the S _ M groups of first digital sequences and a corresponding first mask sequence S _ data1_ mask, a first segment sequence S _ data1_ segment and a first label sequence S _ data1_ label into a group of one-dimensional arrays with the length of S _ M _ S _ L _ 4, and using the one-dimensional arrays as a first input S _ input1_ ids of a single operation of the text segmentation model; generating a plurality of first inputs S _ input1_ ids for a single run when S _ N > S _ M, and generating a first input S _ input1_ ids for a last run with a data amount less than S _ M when S _ N is not an integer multiple of S _ M or S _ N < S _ M;

s604: respectively inputting the first input s _ input1_ ids of a single run of each text segmentation model into the text segmentation model, and performing the following processing: performing linear operation on the first input S _ input1_ ids and an initial variable in the text segmentation model to obtain a hidden layer matrix S _ mat1 of S _ L _ S _ M, then performing operation on the hidden layer matrix S _ mat1, the first input S _ input1_ ids and a random hidden layer state matrix S _ mat5 and internal parameters of the text segmentation model respectively, obtaining a hidden layer matrix S _ mat2, a hidden layer matrix S _ mat3 and a hidden layer matrix S _ mat4 of S _ L _ S _ M respectively, providing the hidden layer matrix S _ mat4 for the first input S _ input1_ ids of the next operation, repeating the processing process, and replacing the hidden layer matrix S _ mat5 with the hidden layer matrix S _ mat4 in the processing process;

s605: embedding text segmentation model word vectors and encoding and decoding on the hidden layer matrix S _ mat1, the hidden layer matrix S _ mat2 and the hidden layer matrix S _ mat3 to obtain S _ K abstracted feature vector matrices S _ mat1, and finishing primary extraction of vector features;

s606: multiplying the S _ mats1 output in the S605 by S _ K weight matrixes S _ w _ mats in the text segmentation model to obtain S _ K new hidden layer matrixes S _ mats 2;

s607: connecting the hidden layer matrixes S _ mats2 in the S606, compressing and reducing dimensions to obtain a compressed matrix S _ squeeze1 of L x K, and further extracting vector features;

s608: processing the hidden layer matrix S _ mats2 and the compressed matrix S _ squeeze1 of S607 in step S606 by a recurrent neural network to obtain tensors S _ mat6 of S _ L (S _ M-1) S _ K and a matrix S _ mat7 of S _ L S _ K;

s609: performing maximum value dimensionality reduction operation on the matrix S _ mat7 in combination with the dimension value S _ D to obtain a one-dimensional vector S _ expand with the length of S _ M;

s610: processing the tensor S _ mat6 through a recurrent neural network and compressing and reducing dimensions to obtain a compression matrix S _ squeeze2 of S _ L (S _ M-1);

s611: connecting the compressed matrix S _ squeeze2 with the vector S _ expanded to obtain a first result matrix S _ matrix 1 of S _ L × S _ M;

s612: summing and dimensionality reduction are carried out on the hidden layer matrix S _ mat2 in the S604 to obtain a digital sequence with the length of S _ L, sequence inversion is carried out on the first result matrix S _ mat rst1 obtained in the step S611 according to the digital sequence to obtain a one-dimensional array S _ mat1_ result with the length of S _ L × S _ M;

s613: according to the process of S604-S612, the text segmentation model integrates all first inputs S _ input _ ids into a result array S _ mat1_ results of S _ L × S _ N after batch processing is completed, and the result array is used as the output result of the text segmentation model;

s614: dividing a result array S _ mat1_ results output by the text segmentation model into S _ N one-dimensional result arrays S _ mat1_ results with the length of L, and reducing the processing results S _ results of the corresponding text blocks one by one according to the numerical values in the mask sequence S _ data1_ mask corresponding to each one-dimensional result array S _ mat1_ results;

s615: and judging the segmentation point of each text block according to the label value related to the processing result s _ result of each text block, completing segmentation of each text block according to the position of the segmentation point, and expressing the number of the processed text blocks by C _ N.

Preferably, step S605 specifically includes the following steps:

s6051: embedding the hidden layer matrix S _ mat1 and the hidden layer matrix S _ mat3 by text segmentation model word vectors to generate a word embedding tensor S _ word _ embedding1 of S _ L _ S _ M _ S _ D, and obtaining a word embedding tensor S _ embedding1 of S _ L _ S _ M _ S _ D through primary layer standardization;

s6053: encoding the hidden layer matrix S _ mat1, the hidden layer matrix S _ mat2 and the word embedding tensor S _ embedding1 through a model encoder to obtain an encoding result S _ encode _ mat1 of S _ L _ S _ M _ S _ D;

s6054: and (3) performing dimensionality reduction and abstraction processing on the tensor matrix S _ encode _ mat1 coded in the S6053 through a pooling layer to obtain S _ K eigenvector matrixes S _ mats1 of S _ L1.

Preferably, step S700 specifically includes the following steps:

s701: loading text classification model configuration, a text classification model dictionary and classification pre-training word vectors, wherein the dimensionality of the word vectors is represented by C _ D, and the text classification model dictionary comprises C _ K labels for defining word classification;

s702: converting the segmented text blocks into corresponding dictionary indexes one by taking single words as a unit through a text classification model dictionary, converting respective single word index arrays of C _ N text blocks into C _ N groups of second digital sequences C _ data2_ in according to a preset sequence longest length C _ L, and constructing respective second segment sequences C _ data2_ segment and second label sequences C _ data2_ label of the C _ N groups of second digital sequences C _ data2_ in;

s703: selecting a C _ M group second number sequence from the C _ N group second number sequence according to a batch processing length C _ M preset by a text classification model, combining each group of number sequences in the C _ M group second number sequence with a corresponding second segment sequence C _ data2_ segment and a second label sequence C _ data2_ label into a group of input matrixes of C _ M (2C _ L + C _ K) as a second input C _ input2_ ids of a single run of the text classification model, generating a plurality of second inputs C _ input2_ ids when C _ N is not an integral multiple of C _ M or C _ N < C _ M, generating a second input C _ input2_ ids of the last run with a data amount smaller than C _ M, and putting the plurality of second inputs C _ input into the text classification model in batches for classification;

s704: respectively inputting the second input c _ input2_ ids of each text classification model in a single run into the text classification models, and performing the following processing: performing linear operation on the second input C _ input2_ ids and an initial variable in the text classification model to obtain a hidden layer matrix C _ mat1 of C _ L _ C _ M, then performing operation on parameters in the text classification model by using the hidden layer matrix C _ mat1, C _ input2_ ids and a random hidden layer matrix C _ mat5 respectively, obtaining a hidden layer matrix C _ mat2 of C _ L _ C _ M, a hidden layer matrix C _ mat3 and a hidden layer matrix C _ mat4 respectively, providing the hidden layer matrix C _ mat4 for the second input C _ input2_ ids in the next operation, repeating the processing process, and replacing the hidden layer matrix C _ mat5 with the hidden layer matrix C _ mat4 in the processing process;

s705: c _ K abstracted eigenvector matrixes C _ mats1 are obtained by encoding and decoding the hidden layer matrix C _ mat1, the hidden layer matrix C _ mat2 and the hidden layer matrix C _ mat3, and preliminary extraction of vector characteristics is completed;

s706: multiplying the eigenvector matrix C _ mats1 in the S705 by the C _ K weight matrices C _ w _ mats in the text classification model to obtain C _ K hidden layer matrices C _ mats 2;

s707: connecting the hidden layer matrixes C _ mats2 in the S706, compressing and reducing dimensions to obtain a two-dimensional second result matrix C _ mats2 of C _ M × C _ K;

s708: compressing and splicing the second result matrix C _ match 2 into a one-dimensional second result array C _ result with the length of C _ M × C _ K;

s709: according to the process of S704 to S708, the text classification model integrates all second inputs C _ input2_ ids into a one-dimensional second result array C _ results of C _ N × C _ K after processing, where each element is the probability that each text block corresponds to C _ K classification labels respectively.

Preferably, step S705 specifically includes the following steps:

s7051: embedding the hidden layer matrix C _ mat1 and the hidden layer matrix C _ mat3 by text classification model word vectors to generate a first word embedding tensor C _ word _ embedding1 of C _ L _ C _ M _ C _ D, and obtaining a second word embedding tensor C _ embedding1 of C _ L _ C _ M _ C _ D through primary layer standardization;

s7052: encoding the hidden layer matrix C _ mat1, the hidden layer matrix C _ mat2 and the second word embedding tensor C _ embedding1 through a text classification model encoder to obtain an encoding result matrix C _ encode _ mat1 of C _ L _ C _ M _ C _ D;

s7053: and performing dimensionality reduction and abstraction processing on the encoding result matrix C _ encode _ mat1 in the step S7052 through a global maximum pooling layer to obtain C _ K feature vector matrices C _ mats1 of C _ L1.

Preferably, step S800 specifically includes the following steps:

s801: traversing S _ N text block data before segmentation, recording coordinate values (x1, y1), (x2, y2) and (x3, y3) of two vertexes of a main diagonal and a midpoint of the main diagonal of each text block, and simultaneously recording the width and the height of each text block;

s802: traversing the segmented C _ N text block data, calculating x-axis and y-axis distances dx1, dy1, dx3 and dy3 between two top points and middle points on a main diagonal corresponding to each text block and other text blocks, the height ratio and width ratio of the two text blocks and the width ratio of the text blocks, forming a one-dimensional array e _ r by the seven data, splicing C _ K probability values corresponding to the text blocks in a second result array C _ results output by S709 to the front end of the e _ r, splicing C _ K probability values corresponding to another text block currently participating in calculation to the rear end of the e _ r to form a one-dimensional array with the length of C _ K2 +7, and forming a one-dimensional array e _ relations of C _ N C _ K2 +7 after all the text blocks are added;

s803: loading a prediction model configuration, a prediction model dictionary and a pre-training word vector, wherein the dimension of the word vector is E _ D, and two types, keys or values of a text block are defined in the prediction model to be used as a final result of model processing;

s804: the segmented C _ N text block data are converted into corresponding dictionary indexes one by taking single words as a unit through a prediction model dictionary one by one, the single word index arrays of the C _ N text blocks are converted into C _ N groups of third digital sequences E _ data3_ in according to the preset sequence longest length E _ L, the C _ N groups of third digital sequences, the set E _ relations and the text block text character arrays E _ data _ texts in S802 are used as prediction model input together, the C _ N groups of text block data are input into the prediction model in batches according to the batch processing length E _ M preset by a text structured extraction model;

s805: mapping and converting the text block original character arrays into word vector matrixes by using prediction model word vectors and prediction model dictionaries, and obtaining a group of hidden layer matrixes of E _ L _ E _ D through an LSTM unit;

s806: segmenting and dimensionality-reducing a third digital sequence E _ data3_ in into E _ L scalar E _ scalars, and respectively performing linear operation on the E _ L scalar E _ scalars and initial variables in a prediction model to obtain E _ L new scalar E _ n _ scalars and intermediate variables E _ matcs;

s807: deforming the intermediate variable E _ matcs obtained in the step S806 to obtain a matrix of E _ L E _ D;

s808: initializing an inner core scalar, combining the inner core scalar and the parameters in the prediction model into an E _ D _ E _ M parameter matrix, performing AND operation on the E _ D _ E _ M parameter matrix and the matrix obtained in S807, and adding an offset bias to obtain an E _ L _ E _ M matrix E _ mat 1;

s809: activating the matrix E _ mat1 obtained in the step S808 through a linear rectification function, and then sorting to obtain a group of E _ L × E _ M matrices E _ mats 1;

s810: processing the information of the set E _ relations according to the same steps of S807-S808, and then sorting to obtain a matrix E _ mat2 of E _ L E _ M;

s811: activating the matrix E _ mat2 through a linear rectification function, and then sorting to obtain a group of E _ L × E _ M matrices E _ mats 2;

s812: further deforming the E _ mats1 obtained in the matrix group S809 and a group of matrices E _ mats2 obtained in S811 to obtain a matrix E _ L _ E _ D, processing the matrices according to the same steps of S808-S809, and finishing to obtain a matrix E _ mat3 of the E _ L _ E _ M;

s813: activating the matrix E _ mat3 through a linear rectification function, and then sorting to obtain a result matrix E _ mat _ result of E _ M × 2;

s814: taking the maximum value of the result matrix E _ mat _ result in the step S813 to obtain the prediction results E _ mat _ results of the prediction model for the E _ M text blocks, and outputting a result array E _ results with the length of C _ N after the prediction model finishes processing all batches of data, wherein each element is correspondingly marked with the key value prediction type of each text block;

s815: and combining the result array e _ results with the original text block data, and extracting key value pair data which accord with the rules through logical position judgment.

Preferably, S806 specifically includes the following steps:

s8061: taking one scalar E _ scalar from the E _ L scalar E _ scalars, performing linear operation on the scalar E _ scalar and initial variables of a prediction model to form a new scalar E _ n _ scalar, and expanding the dimensionality of the hidden layer matrix in the S805 to obtain a tensor E _ tensor1 of the E _ L1E _ D;

s8062: expanding the tensor E _ tensor1 into a tensor E _ tensor2 of E _ L _ E _ D according to the corresponding E _ n _ scalar copy;

s8063: expanding the hidden layer matrix dimension in the step S805 to obtain a tensor E _ tensor3 of 1 × E _ L _ E _ D, and expanding the tensor E _ tensor4 of E _ L _ E _ D again according to the processing method of the step S8062;

s8064: connecting the two tensors E _ tensors 2 and E _ tensors 4 obtained from S8062 and S8063 into a tensor E _ tensor5 of E _ L _ 2E _ D;

s8065: performing exponentiation operation on the scalar E _ scalar and the variable in the prediction model, and combining the tensor E _ tensor5 of E _ L _ 2E _ D obtained by S8064 to obtain a matrix E _ match of E _ L _ E _ L through deformation;

s8066: and respectively processing other scalars in the E _ L scalars according to the steps of S8061-S8065 to obtain a group of intermediate variables E _ matcs.

The invention provides an automatic structuralization system for key information of a document image, which is used for automatically structuralizing and outputting the key information of the document image.

Compared with the prior art, the invention has the following beneficial effects:

(1) the invention organically integrates an optical character recognition technology and a natural language processing technology, adopts the optical character recognition technology to recognize texts, and adopts the natural language processing technology to complete automatic structurization, thereby realizing the automatic flow from the input of various types of samples to the output of expected formatted texts;

(2) the invention can carry out full-page document identification and form identification and output on any type of sample;

(3) the invention can automatically filter the seal, and reduce the interference of the seal on the identification;

(4) the invention provides a method for segmenting a line text at a word level in an automatic structured mode, carrying out accurate classification, and further synthesizing relative position conditions of all words according to classification conditions to realize automatic structured output so as to adapt to a wider and more complex service scene;

the recognition mode of most products still need appoint the business type of sample in the current market, then upload the sample of corresponding type and discern, the selection process of sample type has then been saved to this system, directly upload the sample and click the discernment and can begin the discernment, the system can carry out the structured output with the data in the sample automatically, the user can be directly perceived, find the content that oneself wanted from the recognition result fast, this has simplified operation flow greatly undoubtedly, user experience has been improved.

Drawings

FIG. 1 is a flow diagram of one embodiment of a method for automatically structuring key information of a document image according to the present invention;

FIG. 2 is a diagram of the display effect of a multi-file input interface of an embodiment of the system for automatically structuring key information of document images of the present invention;

FIG. 3 is a diagram of the display effect of the interface of the beginning recognition of one embodiment of the system for automatically structuring key information of document images of the present invention;

FIG. 4 is a diagram of the display effect of the recognition result interface of one embodiment of the system for automatically structuring key information of document images of the present invention;

FIG. 5 is a flow diagram of one embodiment of a method for automatically structuring key information of a document image in accordance with the present invention;

FIG. 6 is a flowchart of text segmentation of an embodiment of a method for automatically structuring key information of a document image according to the present invention.

Detailed Description

The following detailed description of embodiments of the invention refers to the accompanying drawings.

s100: acquiring sample image data of a document;

s400: identifying characters in the sample image by adopting optical character identification, and arranging the characters into a text form according to lines; wherein, still include filtering seal. Whether the sales contract or the business form or the certificate photo is in content, and whether the picture or the pdf or the word document is in file format, the document can be uploaded, identified and then output;

s500: preprocessing a text to obtain text data taking a text block as a unit;

According to a specific embodiment of the present invention, step S100 comprises the steps of:

s101: reading file data of files in various file formats;

According to a specific embodiment of the present invention, between S100 and S300, S200 is further included, where the general text recognition model, the text segmentation model, the text classification model, and the text structured extraction model and their configuration files are loaded for text recognition, text segmentation, text classification, and text structured extraction, respectively.

According to an embodiment of the present invention, step S300 specifically includes the following steps:

According to an embodiment of the present invention, step S500 specifically includes:

restoring the word order through the relative position;

removing part of illegal characters;

clearing an empty text block;

arranging the texts after the optical character recognition in a row unit from left to right and from top to bottom according to the position information, and arranging the texts by taking the cells as the unit if the form exists in the middle; for example, if two lines of texts appear in one cell, the texts are processed into a line of texts, and then the text results are sorted from left to right and from top to bottom;

According to an embodiment of the present invention, step S600 specifically includes the following steps:

s604: respectively inputting the first input s _ input1_ ids of a single run of each text segmentation model into the text segmentation model, and performing the following processing: performing linear operation on the first input S _ input1_ ids and an initial variable in the text segmentation model to obtain a hidden layer matrix S _ mat1 of S _ L _ S _ M, then performing operation on the hidden layer matrix S _ mat1, the first input S _ input1_ ids and a random hidden layer state matrix S _ mat5 and internal parameters of the text segmentation model respectively, obtaining a hidden layer matrix S _ mat2, a hidden layer matrix S _ mat3 and a hidden layer matrix S _ mat4 of S _ L _ S _ M respectively, providing the hidden layer matrix S _ mat4 for the first input S _ input1_ ids of the next operation, repeating the processing process, and replacing the hidden layer matrix S _ mat5 with the hidden layer matrix S _ mat4 in the processing process; initial variables in the model refer to some state variables or coefficient variables in the model, and the state variables or the coefficient variables and the matrix are subjected to linear operation similar to ax + b to obtain a new matrix with the same shape as the original matrix;

s605: embedding text segmentation model word vectors and encoding and decoding on the hidden layer matrix S _ mat1, the hidden layer matrix S _ mat2 and the hidden layer matrix S _ mat3 to obtain S _ K abstracted feature vector matrices S _ mat1, and finishing primary extraction of vector features; the abstract and dimension reduction in the invention are meanings, which means that the word vector containing D (word vector dimension value) feature values is more concentrated and obvious in feature through a dimension reduction mode, and the abstract deformation aims at performing primary feature extraction on a matrix.

s607: connecting the hidden layer matrixes S _ mats2 in the S606, compressing and reducing dimensions to obtain a compressed matrix S _ squeeze1 of L x K, and further extracting vector features; the compression dimensionality reduction can adopt the following method, firstly, the matrixes are connected into a large matrix, the shape of the large matrix is L x1 x K, and then, the large matrix is subjected to dimensionality reduction compression once again to become the matrix of L x K; the purpose of dimension reduction here is still to highlight the features of the corresponding vectors.

According to an embodiment of the present invention, step S605 specifically includes the following steps:

According to an embodiment of the present invention, step S700 specifically includes the following steps:

According to an embodiment of the present invention, step S705 specifically includes the following steps:

According to an embodiment of the present invention, step S800 specifically includes the following steps:

According to an embodiment of the present invention, S806 specifically includes the following steps:

The pre-training word vectors in the segmentation model, the classification model and the structured extraction model are different, and the same dictionary can be adopted in the dictionary. To ensure that the variable domains do not affect each other, the dictionary is loaded along with the model and word vectors.

Results after processing of each model:

cutting into models: the S _ N text block data with the text block as the unit are further divided into smaller units according to the part of speech such as the verb noun, and the smaller units become C _ N text blocks.

Classification models: and further classifying all the C _ N text blocks, wherein each text block corresponds to one subclass of keys or values after classification is finished.

And (3) structuring an extraction model: and predicting the C _ N text blocks by combining the classification result, predicting the possibility that each text block is key or value relative to all other text blocks, and finally sorting out all possible structured key-value pairs according to the prediction result, namely outputting finally.

Example 1

According to an embodiment of the invention, the document image key information automatic structuring method of the invention is described in detail with reference to the accompanying drawings.

s100: acquiring sample image data of a document;

s400: identifying characters in the sample image by adopting optical character identification, and arranging the characters into a text form according to lines;

s500: preprocessing a text to obtain text data taking a text block as a unit;

Example 2

s100: acquiring sample image data of a document; step S100 includes the steps of:

s101: reading file data of files in various file formats;

S200, loading a general text recognition model, a text segmentation model, a text classification model, a text structured extraction model and configuration files thereof, wherein the general text recognition model, the text segmentation model, the text classification model and the text structured extraction model are respectively used for text recognition, text segmentation, text classification and text structured extraction;

s500: preprocessing a text to obtain text data taking a text block as a unit;

Example 3

s101: reading file data of files in various file formats;

s300: carrying out direction correction and gradient correction preprocessing on the sample image; the method specifically comprises the following steps:

s500: preprocessing a text to obtain text data taking a text block as a unit; the method specifically comprises the following steps:

restoring the word order through the relative position;

removing part of illegal characters;

clearing an empty text block;

S600: combining file data with text blocks as units with a model dictionary of a text segmentation model, converting each text block into a digital sequence, obtaining a mask sequence, a segment sequence and a label sequence corresponding to each digital sequence, inputting the digital sequences into a machine learning model for processing, reducing the machine learning model output according to the mask sequence to obtain a processing result of each text block, and segmenting the text blocks according to the label sequence; the method specifically comprises the following steps:

s605: embedding text segmentation model word vectors and encoding and decoding on the hidden layer matrix S _ mat1, the hidden layer matrix S _ mat2 and the hidden layer matrix S _ mat3 to obtain S _ K abstracted feature vector matrices S _ mat1, and finishing primary extraction of vector features; step S605 specifically includes the following steps:

s613: according to the process of S604-S612, the text segmentation model integrates all S _ input _ ids into an S _ L _ S _ N result array S _ mat1_ results after batch processing is completed, and the S _ L _ S _ ID result array is used as the output result of the text segmentation model;

S700: classifying the segmented texts according to a text classification model dictionary, and integrating the classified texts into a one-dimensional array; the method specifically comprises the following steps:

s705: c _ K abstracted eigenvector matrixes C _ mats1 are obtained by encoding and decoding the hidden layer matrix C _ mat1, the hidden layer matrix C _ mat2 and the hidden layer matrix C _ mat3, and preliminary extraction of vector characteristics is completed; step S705 specifically includes the following steps:

S800: generating structured extraction input information according to the distance, the width and the height among the text blocks and the one-dimensional array output in S700, inputting the structured extraction input information into a structured extraction model, predicting the text blocks, and extracting key value pair data which accord with rules according to a prediction result; the method specifically comprises the following steps:

s806: segmenting and dimensionality-reducing a third digital sequence E _ data3_ in into E _ L scalar E _ scalars, and respectively performing linear operation on the E _ L scalar E _ scalars and initial variables in a prediction model to obtain E _ L new scalar E _ n _ scalars and intermediate variables E _ matcs; s806 specifically includes the following steps:

Example 3

The automatic structuring system of the key information of the document image is described in detail according to an embodiment of the invention, and the system is combined with the attached drawings.

The invention provides an automatic structuralization system for key information of a document image, which uses any one of the automatic structuralization methods for key information of the document image to automatically structurize and output the key information of the document image.

The automatic structuring of the sample file by the system comprises the following steps:

1. and clicking the uploading file, uploading the file to be processed, and supporting the simultaneous uploading of a plurality of files, wherein the display effect is shown in the attached figure 2.

2. Selecting a file to be identified in the right file list page, clicking an identification button, and identifying by adopting the automatic document image key information structuring method described in embodiment 1, wherein the display effect is shown in fig. 3.

3. The effect after the recognition is finished is shown in figure 4.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. An automatic structuring method for key information of document images is characterized by comprising the following steps:

s100: acquiring sample image data of a document;

s500: preprocessing a text to obtain text data taking a text block as a unit;

2. The method for automatically structuring key information of document images according to claim 1, wherein the step S100 comprises the steps of:

s101: reading file data of files in various file formats;

3. The method for automatically structuring the key information of the document image as claimed in claim 1, further comprising S200 between S100 and S300, loading a general text recognition model, a text segmentation model, a text classification model, a text structured extraction model and configuration files thereof for text recognition, text segmentation, text classification and text structured extraction, respectively.

4. The method according to claim 3, wherein step S300 specifically comprises the following steps:

5. The method according to claim 4, wherein step S500 specifically comprises:

restoring the word order through the relative position;

removing part of illegal characters;

clearing an empty text block;

6. The method for automatically structuring key information of document images according to claim 5, wherein the step S600 specifically comprises the following steps:

7. The method according to claim 6, wherein step S605 specifically comprises the following steps:

s6052: encoding the hidden layer matrix S _ mat1, the hidden layer matrix S _ mat2 and the word embedding tensor S _ embedding1 through a model encoder to obtain an encoding result S _ encode _ mat1 of S _ L _ S _ M _ S _ D;

s6053: and (3) performing dimensionality reduction and abstraction processing on the tensor matrix S _ encode _ mat1 coded in the S6052 through a pooling layer to obtain S _ K eigenvector matrixes S _ mats1 of S _ L1.

8. The method according to claim 7, wherein step S700 comprises the following steps:

9. The method according to claim 8, wherein step S705 specifically includes the following steps:

10. The method for automatically structuring key information of document images according to claim 9, wherein step S800 comprises the following steps:

11. The method according to claim 10, wherein S806 specifically includes the following steps:

12. An automatic document image key information structuring system, which is characterized in that the automatic document image key information structuring output is carried out by using the automatic document image key information structuring method of any one of claims 1-11.