CN111461109A - Method for identifying documents based on environment multi-type word bank - Google Patents

Method for identifying documents based on environment multi-type word bank Download PDF

Info

Publication number
CN111461109A
CN111461109A CN202010122436.3A CN202010122436A CN111461109A CN 111461109 A CN111461109 A CN 111461109A CN 202010122436 A CN202010122436 A CN 202010122436A CN 111461109 A CN111461109 A CN 111461109A
Authority
CN
China
Prior art keywords
word
constructing
features
professional
matching
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010122436.3A
Other languages
Chinese (zh)
Other versions
CN111461109B (en
Inventor
宣琦
王冠华
俞山青
韩忙
俞立
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University of Technology ZJUT
Original Assignee
Zhejiang University of Technology ZJUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University of Technology ZJUT filed Critical Zhejiang University of Technology ZJUT
Priority to CN202010122436.3A priority Critical patent/CN111461109B/en
Publication of CN111461109A publication Critical patent/CN111461109A/en
Application granted granted Critical
Publication of CN111461109B publication Critical patent/CN111461109B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/62Text, e.g. of license plates, overlay texts or captions on TV images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/30Noise filtering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/26Techniques for post-processing, e.g. correcting the recognition result
    • G06V30/262Techniques for post-processing, e.g. correcting the recognition result using context analysis, e.g. lexical, syntactic or semantic context
    • G06V30/268Lexical context

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Character Discrimination (AREA)

Abstract

A method for identifying documents based on environment multi-type word stock comprises the following steps: step 1: collecting electronic documents, constructing a data set, collecting professional vocabularies of various industries and constructing a professional lexicon; step 2: preprocessing the document image in the data set; and step 3: constructing an image enhancement model for enhancing and denoising the document; and 4, step 4: constructing a character recognition model based on application scene matching correction; and 5: constructing a character recognition model based on professional word stock matching correction; step 6: constructing a character recognition model based on context matching correction; and 7: and carrying out structuring processing on the recognition result. The invention provides a method for identifying a receipt based on multiple word stocks in an environment, which adopts a deep learning algorithm represented by a convolutional neural network and a cyclic neural network, mainly aims at an electronic receipt, and extracts high-level abstract attributes from original information such as a text of the electronic receipt by using the correlation of receipt characters.

Description

Method for identifying documents based on environment multi-type word bank
Technical Field
The invention relates to computer vision, network science and a deep neural network, in particular to a method for identifying documents based on multiple word stocks in environment.
Background
From eastern Han to modern times, people record information through paper, and bank notes, business contracts, financial statements, citizen personal files and the like are recorded on paper documents. In the medical and health field, a large amount of paper documents are generated during the hospitalizing process of patients. These paper documents will have many uses as important vouchers at a later date.
Due to the need of data accumulation and the supervision requirement, the information acquisition of original documents in various industries is very vigorous, but is limited by cost pressure, most of the prior art only acquires invoice information through Business Process Outsourcing (BPO), and other invoice information is often converted into silent data. There are a number of problems with handling these documents purely by hand: time and labor consumption, extremely low efficiency, inconvenience for later statistics and retrieval and the like. Under such circumstances, a method capable of rapidly recognizing the document is urgently required.
There are many limitations to automatically identifying documents: seals are commonly existing in documents, and the existence of the seals brings trouble to character recognition, so the seals need to be removed; the scanned image of the document generally has small-angle inclination, which seriously affects the segmentation of the text block, so the image with the inclination needs to be corrected; the different degrees of clarity of the document scanning pieces bring challenges to the accuracy of recognition, so that the matching of wrong word detection and correct words needs to be carried out on recognized words.
In summary, many problems to be solved urgently exist in the conventional document identification, including that due to factors such as printing precision limitation of the document itself, the document itself is often easily subjected to limitations such as dislocation, wrong line and surface stain, and an effective solution to the challenges brought by correct identification is not available.
Disclosure of Invention
In view of the above, the invention provides a method for identifying documents based on multiple word stocks in an environment, which adopts a deep learning algorithm represented by a convolutional neural network and a cyclic neural network, and mainly aims at electronic documents, and extracts high-level abstract attributes from original information such as texts of the electronic documents by using the correlation of document characters.
The technical scheme adopted by the invention for solving the technical problems is as follows:
a method for identifying documents based on environment multi-type word stock comprises the following steps:
step 1: collecting electronic documents, constructing a data set, collecting professional vocabularies of various industries and constructing a professional lexicon;
step 2: preprocessing the document image in the data set;
and step 3: constructing an image enhancement model for enhancing and denoising the document;
and 4, step 4: constructing a character recognition model based on application scene matching correction;
and 5: constructing a character recognition model based on professional word stock matching correction;
step 6: constructing a character recognition model based on context matching correction;
and 7: and carrying out structuring processing on the recognition result.
Further, in the step 1, collecting the electronic version document, wherein the behavior data does not comprise a part with obvious problem in quality, and the part with obvious problem in quality comprises missing more than forty percent of content or serious overexposure; professional vocabularies of various industries are collected, and the behavior data avoids the repetition of the industries as much as possible so as to avoid generating redundant interference on the following steps.
Further, in the step 2, preprocessing the document in the data set includes the following steps:
2.1) removing the seal for the seal part contained in the document;
2.2) mapping a curve or a straight line with a shape in a rectangular coordinate system where the picture is located to a point in the Hough space to form a peak value by utilizing the transformation between the space where the picture is located and the Hough space of the document image after the last step of processing, and searching an angle corresponding to the peak value to perform image inclination correction.
In the step 3, an image enhancement model for enhancing and denoising the document is constructed, and the method comprises the following processes:
3.1) the input picture is downsampled through a series of convolution layers, and the lower-level features are extracted, wherein the formula for extracting the features is as follows:
Figure RE-GDA0002503921240000021
where i is 1,2 … … n is the index of each convolutional layer, c, c' is the convolutional layer channels index, ω isiAs a convolution kernel weight matrix, biTo bias, σ is the activation function;
3.2) inputting the extracted features of the last step into two layers of convolution layers for further processing, and extracting local features;
3.3) inputting the features extracted in 3.1) into a network consisting of two convolution layers with the step length of 2 and 3 full-connection layers, and extracting global features;
3.4) fusing the local features and the global features through point-by-point affine transformation. And performing up-sampling on the fusion features, and performing affine transformation to obtain final output.
In the step 4, constructing a character recognition model based on application scene matching correction includes the following processes:
4.1) extracting basic network features with the size of W H C (Width Height Channels), further extracting the features on the features obtained in the previous step by using a sliding window, predicting a plurality of target regions to be selected by using the features, inputting the obtained features into a bidirectional L STM, outputting a result, inputting the result into a full-connection layer to obtain a dense target text, inhibiting redundant noise interference, and merging the obtained text sections into a text line by using a text line construction method;
4.2) constructing a node association graph according to the general rule in the industry after preprocessing the text, voting different nodes, and calculating a voting value (PR L), wherein the calculation formula is as follows:
Figure RE-GDA0002503921240000031
wherein N isuRepresenting the u-th node in the dependency graph, ruRepresents NuThe similarity degree with all nodes in the graph is shown, d represents a damping coefficient and generally takes 0.85, and e represents NuNumber of entering edges, C (N)v) Representing a node NvDegree of (d);
4.3) randomly walking on the association graph according to the formula, and iteratively calculating the PR L value of each node until a termination formula is satisfied, so that the node score reaches a convergence state, wherein the termination formula is as follows:
Figure RE-GDA0002503921240000032
where N is the number of nodes, t is the number of random walk steps, σ is the random walk termination threshold, and often σ is 10-4
4.4) carrying out semantic clustering after sequencing the PR L scores in a descending order, and finally taking out all class centers as the keyword extraction results of the text;
and 4.5) classifying different document texts according to the keywords, and further carrying out next step matching.
In the step 5, the method for constructing the character recognition model based on the matching correction of the professional lexicon comprises the following processes:
5.1) matching the documents separated in the step 4 with corresponding professional word banks according to different types;
5.2) since the receipt is related to a professional word stock, the similarity of the words with different issues is calculated for each word stock and is added with different weights, and the matching formula is derived as follows:
Figure RE-GDA0002503921240000033
where P (w) is the probability of the word w occurring in this sequence. Since the probability value ranges from 0 to 1, the relationship between the data is not easy to find, in order to better perform statistical inference on the data, the probability P (w) needs to be subjected to data transformation, and the probability score P after the transformation is PF,G1The effect of (a) is to change the logarithm from negative to positive to ensure non-negativity of the optimized probability score, G2The effect of (1) is to ensure that the probability score ranges between 0 and 1, in this formula G2=G1+1;
Candidate words matched in different word banks have different possibilities, so that a weight coefficient theta is introduced to optimize the sorting problem of the candidate words when the candidate words are sorted, and a calculation formula for determining the overall sorting score of the candidate words is as follows:
ES=θ·PF(5)
the higher the similarity, the larger θ. The formula for θ is as follows:
Figure RE-GDA0002503921240000041
wherein L DPFor the font similarity of the original recognized vocabulary and the candidate words, L DwH is a reward and punishment coefficient of the similarity of the word pronunciation of the original recognized word and the candidate word, and the range is 0<h is less than or equal to 1, and the higher the similarity is, the closer h is to 1;
5.3) the first 5 words of the maximum ES value to be found are sent to the next step.
In the step 6, the method for constructing the text recognition model based on the context matching correction comprises the following processes:
6.1) carrying out context matching on the result obtained in the last step to identify characters, wherein the vector expression calculation formula is as follows:
v(wi)=argmaxjsim(vcontext(wi),w(i,j)) (7)
wherein wiFor the entered word, vcontext(wi) For the context vector, w (i, j) is the multi-prototype vector representation content, resulting in v (w)i) Then, calculating the global similarity, and taking the words with high similarity as the final matching result;
6.2) replacing the original vocabulary with the final matching result and putting the final matching result into the text, and outputting the final matching text.
In the step 7, the identification result is structured, which includes the following processes:
7.1) summarizing the identified document texts, unifying coding formats and outputting the documents in Word format;
7.2) reading Word content, processing the Word content according to a set rule, and storing the Word content in the csv in a structured form.
The invention has the beneficial effects that: a deep learning algorithm represented by a convolutional neural network and a cyclic neural network is adopted, and high-level abstract attributes are extracted from original information such as texts of electronic documents by using the correlation of document characters mainly aiming at the electronic documents.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a flowchart of an embodiment of the present invention for constructing an image enhancement model for enhancing and denoising a document;
FIG. 2 is a flowchart of a method for constructing a text recognition model based on application context matching correction according to an embodiment of the present invention;
fig. 3 is a flowchart for constructing a text recognition model based on professional lexicon matching correction according to an embodiment of the present invention.
FIG. 4 is a flow diagram of a method for identifying documents based on environmental multi-part word repositories.
Detailed Description
To make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
For the convenience of understanding the embodiment, a method for identifying documents by multiple word stocks based on environment disclosed by the embodiment of the invention is described in detail first.
Referring to fig. 1 to 4, a method for identifying documents based on environment multi-type word stocks includes the following steps:
step 1: collecting electronic documents, constructing a data set, collecting professional vocabularies of various industries and constructing a professional lexicon;
step 2: preprocessing the document image in the data set;
and step 3: constructing an image enhancement model for enhancing and denoising the document;
and 4, step 4: constructing a character recognition model based on application scene matching correction;
and 5: constructing a character recognition model based on professional word stock matching correction;
step 6: constructing a character recognition model based on context matching correction;
and 7: and carrying out structuring processing on the recognition result.
Further, in the step 1, collecting the electronic edition document, wherein the behavior data does not include a part with obvious problem in quality, and the part with obvious problem in quality comprises missing more than forty percent of content or serious overexposure and the like; professional vocabularies of various industries are collected, and the behavior data avoids the repetition of the industries as much as possible so as to avoid generating redundant interference on the following steps.
Further, in the step 2, preprocessing the document in the data set includes the following steps:
2.1) removing the seal from the document containing the seal part by a common means;
2.2) mapping a curve or a straight line with a shape in a rectangular coordinate system where the picture is located to a point in the Hough space to form a peak value by utilizing the transformation between the space where the picture is located and the Hough space of the document image after the last step of processing, and searching an angle corresponding to the peak value to perform image inclination correction.
In the step 3, an image enhancement model for enhancing and denoising the document is constructed, and the method comprises the following processes:
3.1) the input picture is downsampled through a series of convolution layers, and the lower-level features are extracted, wherein the formula for extracting the features is as follows:
Figure RE-GDA0002503921240000061
where i is 1,2 … … n is the index of each convolutional layer, c, c' is the convolutional layer channels index, ω isiAs a convolution kernel weight matrix, biTo bias, σ is the activation function;
3.2) inputting the extracted features of the last step into two layers of convolution layers for further processing, and extracting local features;
3.3) inputting the features extracted in 3.1) into a network consisting of two convolution layers with the step length of 2 and 3 full-connection layers, and extracting global features;
and 3.4) fusing the local features and the global features through point-by-point affine transformation, performing up-sampling on the fused features, and performing affine transformation to obtain final output.
In the step 4, constructing a character recognition model based on application scene matching correction includes the following processes:
4.1) extracting basic network features with the size of W H C (Width Height Channels), further extracting the features on the features obtained in the previous step by using a sliding window, predicting a plurality of target regions to be selected by using the features, inputting the obtained features into a bidirectional L STM, outputting a result, inputting the result into a full-connection layer to obtain a dense target text, inhibiting redundant noise interference, and merging the obtained text sections into a text line by using a text line construction method;
4.2) constructing a node association graph according to the general rule in the industry after preprocessing the text, voting different nodes, and calculating a voting value (PR L), wherein the calculation formula is as follows:
Figure RE-GDA0002503921240000062
wherein N isuRepresenting the u-th node in the dependency graph, ruRepresents NuThe similarity degree with all nodes in the graph is shown, d represents a damping coefficient and generally takes 0.85, and e represents NuNumber of entering edges, C (N)v) Representing a node NvDegree of (d);
4.3) randomly walking on the association graph according to the formula, and iteratively calculating the PR L value of each node until a termination formula is satisfied, so that the node score reaches a convergence state, wherein the termination formula is as follows:
Figure RE-GDA0002503921240000063
where N is the number of nodes, t is the number of random walk steps, σ is the random walk termination threshold, and often σ is 10-4
4.4) carrying out semantic clustering after sequencing the PR L scores in a descending order, and finally taking out all class centers as the keyword extraction results of the text;
and 4.5) classifying different document texts according to the keywords, and further carrying out next step matching.
In the step 5, the method for constructing the character recognition model based on the matching correction of the professional lexicon comprises the following processes:
5.1) matching the documents separated in the step 4 with corresponding professional word banks according to different types;
5.2) since documents are related to a professional lexicon, the similarity of the words with different agreements is calculated and weighted differently for each lexicon. The matching formula is derived as follows:
Figure RE-GDA0002503921240000071
where P (w) is the probability of the word w occurring in this sequence. Since the probability value ranges from 0 to 1, the relationship between the data is not easy to find, in order to better perform statistical inference on the data, the probability P (w) needs to be subjected to data transformation, and the probability score P after the transformation is PF,G1The effect of (a) is to change the logarithm from negative to positive to ensure non-negativity of the optimized probability score, G2The effect of (1) is to ensure that the probability score ranges between 0 and 1, in this formula G2=G1+1;
Candidate words matched in different word banks have different possibilities, so that a weight coefficient theta is introduced to optimize the sorting problem of the candidate words when the candidate words are sorted, and a calculation formula for determining the overall sorting score of the candidate words is as follows:
ES=θ·PF(5)
the higher the similarity, the larger θ. The formula for θ is as follows:
Figure RE-GDA0002503921240000072
wherein L DPFor the font similarity of the original recognized vocabulary and the candidate words, L DwH is a reward and punishment coefficient of the similarity of the word pronunciation of the original recognized word and the candidate word, and the range is 0<h is less than or equal to 1, and the higher the similarity is, the closer h is to 1;
5.3) the first 5 words of the maximum ES value to be found are sent to the next step.
In the step 6, the method for constructing the text recognition model based on the context matching correction comprises the following processes:
6.1) carrying out context matching on the result obtained in the last step to identify characters, wherein the vector expression calculation formula is as follows:
v(wi)=argmaxjsim(vcontext(wi),w(i,j)) (7)
wherein wiFor the entered word, vcontext(wi) Is a context vector, w(i, j) is the multi-prototype vector representation content, resulting in v (w)i) Then, calculating the global similarity, and taking the words with high similarity as the final matching result;
6.2) replacing the original vocabulary with the final matching result and putting the final matching result into the text, and outputting the final matching text.
In the step 7, the identification result is structured, which includes the following processes:
7.1) summarizing the identified document texts, unifying the coding format and outputting the document texts in Word format.
7.2) reading Word content, processing the Word content according to a set rule, and storing the Word content in the csv in a structured form.
In the medical document identification method provided by the embodiment of the present invention, the instruction included in the program code may be used to execute the method described in the foregoing method embodiment, and specific implementation may refer to the method embodiment, and details are not described here.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of the system described above may refer to the corresponding process in the foregoing method embodiment, and is not described herein again.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (8)

1. A method for identifying documents based on environment multi-type word stock is characterized by comprising the following steps:
step 1: collecting electronic documents, constructing a data set, collecting professional vocabularies of various industries and constructing a professional lexicon;
step 2: preprocessing the document image in the data set;
and step 3: constructing an image enhancement model for enhancing and denoising the document;
and 4, step 4: constructing a character recognition model based on application scene matching correction;
and 5: constructing a character recognition model based on professional word stock matching correction;
step 6: constructing a character recognition model based on context matching correction;
and 7: and carrying out structuring processing on the recognition result.
2. The method of claim 1, wherein the method comprises the steps of: in the step 1, collecting electronic edition documents, wherein the behavior data does not comprise a part with obvious problems in quality, and the part with obvious problems in quality comprises missing more than forty percent of contents or serious overexposure; professional vocabularies of various industries are collected, and the behavior data avoids the repetition of the professional vocabularies among the industries as much as possible so as to avoid generating redundant interference on the next steps.
3. A method for identifying documents based on environmental multi-part word stocks according to claim 1 or 2, characterized in that: in the step 2, the preprocessing of the documents in the data set includes the following processes:
2.1) removing the seal for the seal part contained in the document;
2.2) mapping a curve or a straight line with a shape in a rectangular coordinate system where the picture is located to a point in the Hough space to form a peak value by utilizing the transformation between the space where the picture is located and the Hough space of the document image after the last step of processing, and searching an angle corresponding to the peak value to perform image inclination correction.
4. A method for identifying documents based on environmental multi-part word stocks according to claim 1 or 2, characterized in that: in the step 3, an image enhancement model for enhancing and denoising the document is constructed, and the method comprises the following processes:
3.1) the input picture is downsampled through a series of convolution layers, and the lower-level features are extracted, wherein the formula for extracting the features is as follows:
Figure FDA0002393369340000011
where i is 1,2 … … n is the index of each convolutional layer, c, c' is the convolutional layer channels index, ω isiAs a convolution kernel weight matrix, biTo bias, σ is the activation function;
3.2) inputting the extracted features of the last step into two layers of convolution layers for further processing, and extracting local features;
3.3) inputting the features extracted in 3.1) into a network consisting of two convolution layers with the step length of 2 and 3 full-connection layers, and extracting global features;
3.4) fusing the local features and the global features through point-by-point affine transformation. And performing up-sampling on the fusion features, and performing affine transformation to obtain final output.
5. A method for identifying documents based on environmental multi-part word stocks according to claim 1 or 2, characterized in that: in the step 4, constructing a character recognition model based on application scene matching correction includes the following processes:
4.1) extracting basic network features, wherein the size is W x H x C, further extracting the features on the features obtained in the previous step by using a sliding window, and predicting a plurality of target regions to be selected by using the features;
4.2) constructing a node association graph according to the general rule in the industry after preprocessing the text, voting different nodes, and calculating a voting value (PR L), wherein the calculation formula is as follows:
Figure FDA0002393369340000021
wherein N isuRepresenting the u-th node in the dependency graph, ruRepresents NuDegree of similarity to all nodes in the graph, d denotes damping coefficient, e denotes NuNumber of entering edges, C (N)v) Representing a node NvDegree of (d);
4.3) randomly walking on the association graph according to the formula, and iteratively calculating the PR L value of each node until a termination formula is satisfied, so that the node score reaches a convergence state, wherein the termination formula is as follows:
Figure FDA0002393369340000022
wherein N is the number of nodes, t is the number of random walk steps, and sigma is the random walk termination threshold;
4.4) carrying out semantic clustering after sequencing the PR L scores in a descending order, and finally taking out all class centers as the keyword extraction results of the text;
and 4.5) classifying different document texts according to the keywords, and further carrying out next step matching.
6. A method for identifying documents based on environmental multi-part word stocks according to claim 1 or 2, characterized in that: in the step 5, the method for constructing the character recognition model based on the matching correction of the professional lexicon comprises the following processes:
5.1) matching the documents separated in the step 4 with corresponding professional word banks according to different types;
5.2) since the receipt is related to a professional word stock, the similarity of the words with different issues is calculated for each word stock and is added with different weights, and the matching formula is derived as follows:
Figure FDA0002393369340000031
wherein P (w) is the probability of the word w in the sequence, the relationship between the data is not easy to be found because the probability value ranges from 0 to 1, the probability P (w) needs to be subjected to data transformation for better statistical inference of the data, and the probability score P after the transformationF,G1The effect of (a) is to change the logarithm from negative to positive to ensure non-negativity of the optimized probability score, G2The effect of (1) is to ensure that the probability score ranges between 0 and 1, in this formula G2=G1+1;
Candidate words matched in different word banks have different possibilities, so that a weight coefficient theta is introduced to optimize the sorting problem of the candidate words when the candidate words are sorted, and a calculation formula for determining the overall sorting score of the candidate words is as follows:
ES=θ·PF(5)
the higher the similarity is, the larger theta is, and the calculation formula of theta is as follows:
Figure FDA0002393369340000032
wherein L DPFor the font similarity of the original recognized vocabulary and the candidate words, L DwH is a reward and punishment coefficient, the similarity between the original recognized word and the word pronunciation of the candidate word is more than 0 and less than or equal to 1, and the higher the similarity is, the closer h is to 1;
5.3) the first 5 words of the maximum ES value to be found are sent to the next step.
7. A method for identifying documents based on environmental multi-part word stocks according to claim 1 or 2, characterized in that: in the step 6, the method for constructing the text recognition model based on the context matching correction comprises the following processes:
6.1) carrying out context matching on the result obtained in the last step to identify characters, wherein the vector expression calculation formula is as follows:
v(wi)=argmaxjsim(vcontext(wi),w(i,j)) (7)
wherein wiFor the entered word, vcontext(wi) For the context vector, w (i, j) is the multi-prototype vector representation content, resulting in v (w)i) Then, calculating the global similarity, and taking the words with high similarity as the final matching result;
6.2) putting the final matching result into the text and outputting the final matching text.
8. A method for identifying documents based on environmental multi-part word stocks according to claim 1 or 2, characterized in that: in the step 7, the identification result is structured, which includes the following processes:
7.1) summarizing the identified document texts, unifying coding formats and outputting the documents in Word format;
7.2) reading Word content, processing the Word content according to a set rule, and storing the Word content in the csv in a structured form.
CN202010122436.3A 2020-02-27 2020-02-27 Method for identifying documents based on environment multi-class word stock Active CN111461109B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010122436.3A CN111461109B (en) 2020-02-27 2020-02-27 Method for identifying documents based on environment multi-class word stock

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010122436.3A CN111461109B (en) 2020-02-27 2020-02-27 Method for identifying documents based on environment multi-class word stock

Publications (2)

Publication Number Publication Date
CN111461109A true CN111461109A (en) 2020-07-28
CN111461109B CN111461109B (en) 2023-09-15

Family

ID=71685055

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010122436.3A Active CN111461109B (en) 2020-02-27 2020-02-27 Method for identifying documents based on environment multi-class word stock

Country Status (1)

Country Link
CN (1) CN111461109B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112633389A (en) * 2020-12-28 2021-04-09 西北工业大学 Method for calculating trend of hurricane motion track based on MDL and speed direction

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3139279A1 (en) * 2015-09-01 2017-03-08 Dream It Get IT Limited Media unit retrieval and related processes
US20180137551A1 (en) * 2016-11-11 2018-05-17 Ebay Inc. Intelligent online personal assistant with image text localization
CN108132927A (en) * 2017-12-07 2018-06-08 西北师范大学 A kind of fusion graph structure and the associated keyword extracting method of node
CN110321925A (en) * 2019-05-24 2019-10-11 中国工程物理研究院计算机应用研究所 A kind of more granularity similarity comparison methods of text based on semantics fusion fingerprint

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3139279A1 (en) * 2015-09-01 2017-03-08 Dream It Get IT Limited Media unit retrieval and related processes
US20180137551A1 (en) * 2016-11-11 2018-05-17 Ebay Inc. Intelligent online personal assistant with image text localization
CN108132927A (en) * 2017-12-07 2018-06-08 西北师范大学 A kind of fusion graph structure and the associated keyword extracting method of node
CN110321925A (en) * 2019-05-24 2019-10-11 中国工程物理研究院计算机应用研究所 A kind of more granularity similarity comparison methods of text based on semantics fusion fingerprint

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
YU-JIE XIONG等: "Chinese Writer Identification Using Contour-Directional Feature and Character Pair Similarity Measurement" *
李琴等: "医学文本图像字符识别校正技术研究与应用" *
邵文良: "基于深度学习的医疗单据图文识别关键技术研究与实现" *
郭鸿奇等: "一种基于词语多原型向量表示的句子相似度计算方法" *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112633389A (en) * 2020-12-28 2021-04-09 西北工业大学 Method for calculating trend of hurricane motion track based on MDL and speed direction
CN112633389B (en) * 2020-12-28 2024-01-23 西北工业大学 Hurricane movement track trend calculation method based on MDL and speed direction

Also Published As

Publication number Publication date
CN111461109B (en) 2023-09-15

Similar Documents

Publication Publication Date Title
RU2757713C1 (en) Handwriting recognition using neural networks
WO2007080642A1 (en) Sheet slip processing program and sheet slip program device
CN113033183B (en) Network new word discovery method and system based on statistics and similarity
CN111461108A (en) Medical document identification method
Gordo et al. Document classification and page stream segmentation for digital mailroom applications
Boillet et al. Robust text line detection in historical documents: learning and evaluation methods
CN114818718A (en) Contract text recognition method and device
US8340428B2 (en) Unsupervised writer style adaptation for handwritten word spotting
JPH11328317A (en) Method and device for correcting japanese character recognition error and recording medium with error correcting program recorded
Al Ghamdi A novel approach to printed Arabic optical character recognition
CN111461109B (en) Method for identifying documents based on environment multi-class word stock
Chandra et al. Optical character recognition-A review
Reisi et al. Authorship attribution in historical and literary texts by a deep learning classifier
Vitadhani et al. Detection of clickbait thumbnails on YouTube using tesseract-OCR, face recognition, and text alteration
CN113762160A (en) Date extraction method and device, computer equipment and storage medium
US20140093173A1 (en) Classifying a string formed from hand-written characters
Desai et al. A Survey On Automatic Subjective Answer Evaluation
CN113051886A (en) Test question duplicate checking method and device, storage medium and equipment
Oprean et al. Handwritten word recognition using Web resources and recurrent neural networks
CN111611379A (en) Text information classification method, device, equipment and readable storage medium
Nisa et al. Annotation of struck-out text in handwritten documents
Maarouf et al. Correcting optical character recognition result via a novel approach
US20230419367A1 (en) Apparatus and method for communicating with users
US20230343122A1 (en) Performing optical character recognition based on fuzzy pattern search generated using image transformation
Nordin Hällgren Reading Key Figures from Annual Reports

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant