CN111461109A

CN111461109A - Method for identifying documents based on environment multi-type word bank

Info

Publication number: CN111461109A
Application number: CN202010122436.3A
Authority: CN
Inventors: 宣琦; 王冠华; 俞山青; 韩忙; 俞立
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2020-02-27
Filing date: 2020-02-27
Publication date: 2020-07-28
Anticipated expiration: 2040-02-27
Also published as: CN111461109B

Abstract

A method for identifying documents based on environment multi-type word stock comprises the following steps: step 1: collecting electronic documents, constructing a data set, collecting professional vocabularies of various industries and constructing a professional lexicon; step 2: preprocessing the document image in the data set; and step 3: constructing an image enhancement model for enhancing and denoising the document; and 4, step 4: constructing a character recognition model based on application scene matching correction; and 5: constructing a character recognition model based on professional word stock matching correction; step 6: constructing a character recognition model based on context matching correction; and 7: and carrying out structuring processing on the recognition result. The invention provides a method for identifying a receipt based on multiple word stocks in an environment, which adopts a deep learning algorithm represented by a convolutional neural network and a cyclic neural network, mainly aims at an electronic receipt, and extracts high-level abstract attributes from original information such as a text of the electronic receipt by using the correlation of receipt characters.

Description

Method for identifying documents based on environment multi-type word bank

Technical Field

The invention relates to computer vision, network science and a deep neural network, in particular to a method for identifying documents based on multiple word stocks in environment.

Background

From eastern Han to modern times, people record information through paper, and bank notes, business contracts, financial statements, citizen personal files and the like are recorded on paper documents. In the medical and health field, a large amount of paper documents are generated during the hospitalizing process of patients. These paper documents will have many uses as important vouchers at a later date.

Due to the need of data accumulation and the supervision requirement, the information acquisition of original documents in various industries is very vigorous, but is limited by cost pressure, most of the prior art only acquires invoice information through Business Process Outsourcing (BPO), and other invoice information is often converted into silent data. There are a number of problems with handling these documents purely by hand: time and labor consumption, extremely low efficiency, inconvenience for later statistics and retrieval and the like. Under such circumstances, a method capable of rapidly recognizing the document is urgently required.

There are many limitations to automatically identifying documents: seals are commonly existing in documents, and the existence of the seals brings trouble to character recognition, so the seals need to be removed; the scanned image of the document generally has small-angle inclination, which seriously affects the segmentation of the text block, so the image with the inclination needs to be corrected; the different degrees of clarity of the document scanning pieces bring challenges to the accuracy of recognition, so that the matching of wrong word detection and correct words needs to be carried out on recognized words.

In summary, many problems to be solved urgently exist in the conventional document identification, including that due to factors such as printing precision limitation of the document itself, the document itself is often easily subjected to limitations such as dislocation, wrong line and surface stain, and an effective solution to the challenges brought by correct identification is not available.

Disclosure of Invention

In view of the above, the invention provides a method for identifying documents based on multiple word stocks in an environment, which adopts a deep learning algorithm represented by a convolutional neural network and a cyclic neural network, and mainly aims at electronic documents, and extracts high-level abstract attributes from original information such as texts of the electronic documents by using the correlation of document characters.

The technical scheme adopted by the invention for solving the technical problems is as follows:

a method for identifying documents based on environment multi-type word stock comprises the following steps:

step 1: collecting electronic documents, constructing a data set, collecting professional vocabularies of various industries and constructing a professional lexicon;

step 2: preprocessing the document image in the data set;

and step 3: constructing an image enhancement model for enhancing and denoising the document;

and 4, step 4: constructing a character recognition model based on application scene matching correction;

and 5: constructing a character recognition model based on professional word stock matching correction;

step 6: constructing a character recognition model based on context matching correction;

and 7: and carrying out structuring processing on the recognition result.

Further, in the step 1, collecting the electronic version document, wherein the behavior data does not comprise a part with obvious problem in quality, and the part with obvious problem in quality comprises missing more than forty percent of content or serious overexposure; professional vocabularies of various industries are collected, and the behavior data avoids the repetition of the industries as much as possible so as to avoid generating redundant interference on the following steps.

Further, in the step 2, preprocessing the document in the data set includes the following steps:

2.1) removing the seal for the seal part contained in the document;

2.2) mapping a curve or a straight line with a shape in a rectangular coordinate system where the picture is located to a point in the Hough space to form a peak value by utilizing the transformation between the space where the picture is located and the Hough space of the document image after the last step of processing, and searching an angle corresponding to the peak value to perform image inclination correction.

In the step 3, an image enhancement model for enhancing and denoising the document is constructed, and the method comprises the following processes:

3.1) the input picture is downsampled through a series of convolution layers, and the lower-level features are extracted, wherein the formula for extracting the features is as follows:

where i is 1,2 … … n is the index of each convolutional layer, c, c' is the convolutional layer channels index, ω isⁱAs a convolution kernel weight matrix, bⁱTo bias, σ is the activation function;

3.2) inputting the extracted features of the last step into two layers of convolution layers for further processing, and extracting local features;

3.3) inputting the features extracted in 3.1) into a network consisting of two convolution layers with the step length of 2 and 3 full-connection layers, and extracting global features;

3.4) fusing the local features and the global features through point-by-point affine transformation. And performing up-sampling on the fusion features, and performing affine transformation to obtain final output.

In the step 4, constructing a character recognition model based on application scene matching correction includes the following processes:

4.1) extracting basic network features with the size of W H C (Width Height Channels), further extracting the features on the features obtained in the previous step by using a sliding window, predicting a plurality of target regions to be selected by using the features, inputting the obtained features into a bidirectional L STM, outputting a result, inputting the result into a full-connection layer to obtain a dense target text, inhibiting redundant noise interference, and merging the obtained text sections into a text line by using a text line construction method;

4.2) constructing a node association graph according to the general rule in the industry after preprocessing the text, voting different nodes, and calculating a voting value (PR L), wherein the calculation formula is as follows:

wherein N is_uRepresenting the u-th node in the dependency graph, r_uRepresents N_uThe similarity degree with all nodes in the graph is shown, d represents a damping coefficient and generally takes 0.85, and e represents N_uNumber of entering edges, C (N)_v) Representing a node N_vDegree of (d);

4.3) randomly walking on the association graph according to the formula, and iteratively calculating the PR L value of each node until a termination formula is satisfied, so that the node score reaches a convergence state, wherein the termination formula is as follows:

where N is the number of nodes, t is the number of random walk steps, σ is the random walk termination threshold, and often σ is 10^-4；

4.4) carrying out semantic clustering after sequencing the PR L scores in a descending order, and finally taking out all class centers as the keyword extraction results of the text;

and 4.5) classifying different document texts according to the keywords, and further carrying out next step matching.

In the step 5, the method for constructing the character recognition model based on the matching correction of the professional lexicon comprises the following processes:

5.1) matching the documents separated in the step 4 with corresponding professional word banks according to different types;

5.2) since the receipt is related to a professional word stock, the similarity of the words with different issues is calculated for each word stock and is added with different weights, and the matching formula is derived as follows:

where P (w) is the probability of the word w occurring in this sequence. Since the probability value ranges from 0 to 1, the relationship between the data is not easy to find, in order to better perform statistical inference on the data, the probability P (w) needs to be subjected to data transformation, and the probability score P after the transformation is P_F,G₁The effect of (a) is to change the logarithm from negative to positive to ensure non-negativity of the optimized probability score, G₂The effect of (1) is to ensure that the probability score ranges between 0 and 1, in this formula G₂＝G₁+1；

Candidate words matched in different word banks have different possibilities, so that a weight coefficient theta is introduced to optimize the sorting problem of the candidate words when the candidate words are sorted, and a calculation formula for determining the overall sorting score of the candidate words is as follows:

ES＝θ·P_F(5)

the higher the similarity, the larger θ. The formula for θ is as follows:

wherein L D_PFor the font similarity of the original recognized vocabulary and the candidate words, L D_wH is a reward and punishment coefficient of the similarity of the word pronunciation of the original recognized word and the candidate word, and the range is 0<h is less than or equal to 1, and the higher the similarity is, the closer h is to 1;

5.3) the first 5 words of the maximum ES value to be found are sent to the next step.

In the step 6, the method for constructing the text recognition model based on the context matching correction comprises the following processes:

6.1) carrying out context matching on the result obtained in the last step to identify characters, wherein the vector expression calculation formula is as follows:

v(w_i)＝argmax_jsim(v_context(w_i),w(i,j)) (7)

wherein w_iFor the entered word, v_context(w_i) For the context vector, w (i, j) is the multi-prototype vector representation content, resulting in v (w)_i) Then, calculating the global similarity, and taking the words with high similarity as the final matching result;

6.2) replacing the original vocabulary with the final matching result and putting the final matching result into the text, and outputting the final matching text.

In the step 7, the identification result is structured, which includes the following processes:

7.1) summarizing the identified document texts, unifying coding formats and outputting the documents in Word format;

7.2) reading Word content, processing the Word content according to a set rule, and storing the Word content in the csv in a structured form.

The invention has the beneficial effects that: a deep learning algorithm represented by a convolutional neural network and a cyclic neural network is adopted, and high-level abstract attributes are extracted from original information such as texts of electronic documents by using the correlation of document characters mainly aiming at the electronic documents.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a flowchart of an embodiment of the present invention for constructing an image enhancement model for enhancing and denoising a document;

FIG. 2 is a flowchart of a method for constructing a text recognition model based on application context matching correction according to an embodiment of the present invention;

fig. 3 is a flowchart for constructing a text recognition model based on professional lexicon matching correction according to an embodiment of the present invention.

FIG. 4 is a flow diagram of a method for identifying documents based on environmental multi-part word repositories.

Detailed Description

To make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

For the convenience of understanding the embodiment, a method for identifying documents by multiple word stocks based on environment disclosed by the embodiment of the invention is described in detail first.

Referring to fig. 1 to 4, a method for identifying documents based on environment multi-type word stocks includes the following steps:

step 2: preprocessing the document image in the data set;

and 7: and carrying out structuring processing on the recognition result.

Further, in the step 1, collecting the electronic edition document, wherein the behavior data does not include a part with obvious problem in quality, and the part with obvious problem in quality comprises missing more than forty percent of content or serious overexposure and the like; professional vocabularies of various industries are collected, and the behavior data avoids the repetition of the industries as much as possible so as to avoid generating redundant interference on the following steps.

2.1) removing the seal from the document containing the seal part by a common means;

and 3.4) fusing the local features and the global features through point-by-point affine transformation, performing up-sampling on the fused features, and performing affine transformation to obtain final output.

5.2) since documents are related to a professional lexicon, the similarity of the words with different agreements is calculated and weighted differently for each lexicon. The matching formula is derived as follows:

ES＝θ·P_F(5)

the higher the similarity, the larger θ. The formula for θ is as follows:

v(w_i)＝argmax_jsim(v_context(w_i),w(i,j)) (7)

wherein w_iFor the entered word, v_context(w_i) Is a context vector, w(i, j) is the multi-prototype vector representation content, resulting in v (w)_i) Then, calculating the global similarity, and taking the words with high similarity as the final matching result;

7.1) summarizing the identified document texts, unifying the coding format and outputting the document texts in Word format.

In the medical document identification method provided by the embodiment of the present invention, the instruction included in the program code may be used to execute the method described in the foregoing method embodiment, and specific implementation may refer to the method embodiment, and details are not described here.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of the system described above may refer to the corresponding process in the foregoing method embodiment, and is not described herein again.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for identifying documents based on environment multi-type word stock is characterized by comprising the following steps:

step 2: preprocessing the document image in the data set;

and 7: and carrying out structuring processing on the recognition result.

2. The method of claim 1, wherein the method comprises the steps of: in the step 1, collecting electronic edition documents, wherein the behavior data does not comprise a part with obvious problems in quality, and the part with obvious problems in quality comprises missing more than forty percent of contents or serious overexposure; professional vocabularies of various industries are collected, and the behavior data avoids the repetition of the professional vocabularies among the industries as much as possible so as to avoid generating redundant interference on the next steps.

3. A method for identifying documents based on environmental multi-part word stocks according to claim 1 or 2, characterized in that: in the step 2, the preprocessing of the documents in the data set includes the following processes:

2.1) removing the seal for the seal part contained in the document;

4. A method for identifying documents based on environmental multi-part word stocks according to claim 1 or 2, characterized in that: in the step 3, an image enhancement model for enhancing and denoising the document is constructed, and the method comprises the following processes:

5. A method for identifying documents based on environmental multi-part word stocks according to claim 1 or 2, characterized in that: in the step 4, constructing a character recognition model based on application scene matching correction includes the following processes:

4.1) extracting basic network features, wherein the size is W x H x C, further extracting the features on the features obtained in the previous step by using a sliding window, and predicting a plurality of target regions to be selected by using the features;

wherein N is_uRepresenting the u-th node in the dependency graph, r_uRepresents N_uDegree of similarity to all nodes in the graph, d denotes damping coefficient, e denotes N_uNumber of entering edges, C (N)_v) Representing a node N_vDegree of (d);

wherein N is the number of nodes, t is the number of random walk steps, and sigma is the random walk termination threshold;

6. A method for identifying documents based on environmental multi-part word stocks according to claim 1 or 2, characterized in that: in the step 5, the method for constructing the character recognition model based on the matching correction of the professional lexicon comprises the following processes:

wherein P (w) is the probability of the word w in the sequence, the relationship between the data is not easy to be found because the probability value ranges from 0 to 1, the probability P (w) needs to be subjected to data transformation for better statistical inference of the data, and the probability score P after the transformation_F,G₁The effect of (a) is to change the logarithm from negative to positive to ensure non-negativity of the optimized probability score, G₂The effect of (1) is to ensure that the probability score ranges between 0 and 1, in this formula G₂＝G₁+1；

ES＝θ·P_F(5)

the higher the similarity is, the larger theta is, and the calculation formula of theta is as follows:

wherein L D_PFor the font similarity of the original recognized vocabulary and the candidate words, L D_wH is a reward and punishment coefficient, the similarity between the original recognized word and the word pronunciation of the candidate word is more than 0 and less than or equal to 1, and the higher the similarity is, the closer h is to 1;

7. A method for identifying documents based on environmental multi-part word stocks according to claim 1 or 2, characterized in that: in the step 6, the method for constructing the text recognition model based on the context matching correction comprises the following processes:

v(w_i)＝argmax_jsim(v_context(w_i)，w(i，j)) (7)

6.2) putting the final matching result into the text and outputting the final matching text.

8. A method for identifying documents based on environmental multi-part word stocks according to claim 1 or 2, characterized in that: in the step 7, the identification result is structured, which includes the following processes: