CN111461109A - Method for identifying documents based on environment multi-type word bank - Google Patents
Method for identifying documents based on environment multi-type word bank Download PDFInfo
- Publication number
- CN111461109A CN111461109A CN202010122436.3A CN202010122436A CN111461109A CN 111461109 A CN111461109 A CN 111461109A CN 202010122436 A CN202010122436 A CN 202010122436A CN 111461109 A CN111461109 A CN 111461109A
- Authority
- CN
- China
- Prior art keywords
- word
- constructing
- features
- professional
- matching
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/60—Type of objects
- G06V20/62—Text, e.g. of license plates, overlay texts or captions on TV images
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/30—Noise filtering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/26—Techniques for post-processing, e.g. correcting the recognition result
- G06V30/262—Techniques for post-processing, e.g. correcting the recognition result using context analysis, e.g. lexical, syntactic or semantic context
- G06V30/268—Lexical context
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Molecular Biology (AREA)
- Mathematical Physics (AREA)
- Multimedia (AREA)
- General Health & Medical Sciences (AREA)
- Biophysics (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Character Discrimination (AREA)
Abstract
A method for identifying documents based on environment multi-type word stock comprises the following steps: step 1: collecting electronic documents, constructing a data set, collecting professional vocabularies of various industries and constructing a professional lexicon; step 2: preprocessing the document image in the data set; and step 3: constructing an image enhancement model for enhancing and denoising the document; and 4, step 4: constructing a character recognition model based on application scene matching correction; and 5: constructing a character recognition model based on professional word stock matching correction; step 6: constructing a character recognition model based on context matching correction; and 7: and carrying out structuring processing on the recognition result. The invention provides a method for identifying a receipt based on multiple word stocks in an environment, which adopts a deep learning algorithm represented by a convolutional neural network and a cyclic neural network, mainly aims at an electronic receipt, and extracts high-level abstract attributes from original information such as a text of the electronic receipt by using the correlation of receipt characters.
Description
Technical Field
The invention relates to computer vision, network science and a deep neural network, in particular to a method for identifying documents based on multiple word stocks in environment.
Background
From eastern Han to modern times, people record information through paper, and bank notes, business contracts, financial statements, citizen personal files and the like are recorded on paper documents. In the medical and health field, a large amount of paper documents are generated during the hospitalizing process of patients. These paper documents will have many uses as important vouchers at a later date.
Due to the need of data accumulation and the supervision requirement, the information acquisition of original documents in various industries is very vigorous, but is limited by cost pressure, most of the prior art only acquires invoice information through Business Process Outsourcing (BPO), and other invoice information is often converted into silent data. There are a number of problems with handling these documents purely by hand: time and labor consumption, extremely low efficiency, inconvenience for later statistics and retrieval and the like. Under such circumstances, a method capable of rapidly recognizing the document is urgently required.
There are many limitations to automatically identifying documents: seals are commonly existing in documents, and the existence of the seals brings trouble to character recognition, so the seals need to be removed; the scanned image of the document generally has small-angle inclination, which seriously affects the segmentation of the text block, so the image with the inclination needs to be corrected; the different degrees of clarity of the document scanning pieces bring challenges to the accuracy of recognition, so that the matching of wrong word detection and correct words needs to be carried out on recognized words.
In summary, many problems to be solved urgently exist in the conventional document identification, including that due to factors such as printing precision limitation of the document itself, the document itself is often easily subjected to limitations such as dislocation, wrong line and surface stain, and an effective solution to the challenges brought by correct identification is not available.
Disclosure of Invention
In view of the above, the invention provides a method for identifying documents based on multiple word stocks in an environment, which adopts a deep learning algorithm represented by a convolutional neural network and a cyclic neural network, and mainly aims at electronic documents, and extracts high-level abstract attributes from original information such as texts of the electronic documents by using the correlation of document characters.
The technical scheme adopted by the invention for solving the technical problems is as follows:
a method for identifying documents based on environment multi-type word stock comprises the following steps:
step 1: collecting electronic documents, constructing a data set, collecting professional vocabularies of various industries and constructing a professional lexicon;
step 2: preprocessing the document image in the data set;
and step 3: constructing an image enhancement model for enhancing and denoising the document;
and 4, step 4: constructing a character recognition model based on application scene matching correction;
and 5: constructing a character recognition model based on professional word stock matching correction;
step 6: constructing a character recognition model based on context matching correction;
and 7: and carrying out structuring processing on the recognition result.
Further, in the step 1, collecting the electronic version document, wherein the behavior data does not comprise a part with obvious problem in quality, and the part with obvious problem in quality comprises missing more than forty percent of content or serious overexposure; professional vocabularies of various industries are collected, and the behavior data avoids the repetition of the industries as much as possible so as to avoid generating redundant interference on the following steps.
Further, in the step 2, preprocessing the document in the data set includes the following steps:
2.1) removing the seal for the seal part contained in the document;
2.2) mapping a curve or a straight line with a shape in a rectangular coordinate system where the picture is located to a point in the Hough space to form a peak value by utilizing the transformation between the space where the picture is located and the Hough space of the document image after the last step of processing, and searching an angle corresponding to the peak value to perform image inclination correction.
In the step 3, an image enhancement model for enhancing and denoising the document is constructed, and the method comprises the following processes:
3.1) the input picture is downsampled through a series of convolution layers, and the lower-level features are extracted, wherein the formula for extracting the features is as follows:
where i is 1,2 … … n is the index of each convolutional layer, c, c' is the convolutional layer channels index, ω isiAs a convolution kernel weight matrix, biTo bias, σ is the activation function;
3.2) inputting the extracted features of the last step into two layers of convolution layers for further processing, and extracting local features;
3.3) inputting the features extracted in 3.1) into a network consisting of two convolution layers with the step length of 2 and 3 full-connection layers, and extracting global features;
3.4) fusing the local features and the global features through point-by-point affine transformation. And performing up-sampling on the fusion features, and performing affine transformation to obtain final output.
In the step 4, constructing a character recognition model based on application scene matching correction includes the following processes:
4.1) extracting basic network features with the size of W H C (Width Height Channels), further extracting the features on the features obtained in the previous step by using a sliding window, predicting a plurality of target regions to be selected by using the features, inputting the obtained features into a bidirectional L STM, outputting a result, inputting the result into a full-connection layer to obtain a dense target text, inhibiting redundant noise interference, and merging the obtained text sections into a text line by using a text line construction method;
4.2) constructing a node association graph according to the general rule in the industry after preprocessing the text, voting different nodes, and calculating a voting value (PR L), wherein the calculation formula is as follows:
wherein N isuRepresenting the u-th node in the dependency graph, ruRepresents NuThe similarity degree with all nodes in the graph is shown, d represents a damping coefficient and generally takes 0.85, and e represents NuNumber of entering edges, C (N)v) Representing a node NvDegree of (d);
4.3) randomly walking on the association graph according to the formula, and iteratively calculating the PR L value of each node until a termination formula is satisfied, so that the node score reaches a convergence state, wherein the termination formula is as follows:
where N is the number of nodes, t is the number of random walk steps, σ is the random walk termination threshold, and often σ is 10-4;
4.4) carrying out semantic clustering after sequencing the PR L scores in a descending order, and finally taking out all class centers as the keyword extraction results of the text;
and 4.5) classifying different document texts according to the keywords, and further carrying out next step matching.
In the step 5, the method for constructing the character recognition model based on the matching correction of the professional lexicon comprises the following processes:
5.1) matching the documents separated in the step 4 with corresponding professional word banks according to different types;
5.2) since the receipt is related to a professional word stock, the similarity of the words with different issues is calculated for each word stock and is added with different weights, and the matching formula is derived as follows:
where P (w) is the probability of the word w occurring in this sequence. Since the probability value ranges from 0 to 1, the relationship between the data is not easy to find, in order to better perform statistical inference on the data, the probability P (w) needs to be subjected to data transformation, and the probability score P after the transformation is PF,G1The effect of (a) is to change the logarithm from negative to positive to ensure non-negativity of the optimized probability score, G2The effect of (1) is to ensure that the probability score ranges between 0 and 1, in this formula G2=G1+1;
Candidate words matched in different word banks have different possibilities, so that a weight coefficient theta is introduced to optimize the sorting problem of the candidate words when the candidate words are sorted, and a calculation formula for determining the overall sorting score of the candidate words is as follows:
ES=θ·PF(5)
the higher the similarity, the larger θ. The formula for θ is as follows:
wherein L DPFor the font similarity of the original recognized vocabulary and the candidate words, L DwH is a reward and punishment coefficient of the similarity of the word pronunciation of the original recognized word and the candidate word, and the range is 0<h is less than or equal to 1, and the higher the similarity is, the closer h is to 1;
5.3) the first 5 words of the maximum ES value to be found are sent to the next step.
In the step 6, the method for constructing the text recognition model based on the context matching correction comprises the following processes:
6.1) carrying out context matching on the result obtained in the last step to identify characters, wherein the vector expression calculation formula is as follows:
v(wi)=argmaxjsim(vcontext(wi),w(i,j)) (7)
wherein wiFor the entered word, vcontext(wi) For the context vector, w (i, j) is the multi-prototype vector representation content, resulting in v (w)i) Then, calculating the global similarity, and taking the words with high similarity as the final matching result;
6.2) replacing the original vocabulary with the final matching result and putting the final matching result into the text, and outputting the final matching text.
In the step 7, the identification result is structured, which includes the following processes:
7.1) summarizing the identified document texts, unifying coding formats and outputting the documents in Word format;
7.2) reading Word content, processing the Word content according to a set rule, and storing the Word content in the csv in a structured form.
The invention has the beneficial effects that: a deep learning algorithm represented by a convolutional neural network and a cyclic neural network is adopted, and high-level abstract attributes are extracted from original information such as texts of electronic documents by using the correlation of document characters mainly aiming at the electronic documents.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a flowchart of an embodiment of the present invention for constructing an image enhancement model for enhancing and denoising a document;
FIG. 2 is a flowchart of a method for constructing a text recognition model based on application context matching correction according to an embodiment of the present invention;
fig. 3 is a flowchart for constructing a text recognition model based on professional lexicon matching correction according to an embodiment of the present invention.
FIG. 4 is a flow diagram of a method for identifying documents based on environmental multi-part word repositories.
Detailed Description
To make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
For the convenience of understanding the embodiment, a method for identifying documents by multiple word stocks based on environment disclosed by the embodiment of the invention is described in detail first.
Referring to fig. 1 to 4, a method for identifying documents based on environment multi-type word stocks includes the following steps:
step 1: collecting electronic documents, constructing a data set, collecting professional vocabularies of various industries and constructing a professional lexicon;
step 2: preprocessing the document image in the data set;
and step 3: constructing an image enhancement model for enhancing and denoising the document;
and 4, step 4: constructing a character recognition model based on application scene matching correction;
and 5: constructing a character recognition model based on professional word stock matching correction;
step 6: constructing a character recognition model based on context matching correction;
and 7: and carrying out structuring processing on the recognition result.
Further, in the step 1, collecting the electronic edition document, wherein the behavior data does not include a part with obvious problem in quality, and the part with obvious problem in quality comprises missing more than forty percent of content or serious overexposure and the like; professional vocabularies of various industries are collected, and the behavior data avoids the repetition of the industries as much as possible so as to avoid generating redundant interference on the following steps.
Further, in the step 2, preprocessing the document in the data set includes the following steps:
2.1) removing the seal from the document containing the seal part by a common means;
2.2) mapping a curve or a straight line with a shape in a rectangular coordinate system where the picture is located to a point in the Hough space to form a peak value by utilizing the transformation between the space where the picture is located and the Hough space of the document image after the last step of processing, and searching an angle corresponding to the peak value to perform image inclination correction.
In the step 3, an image enhancement model for enhancing and denoising the document is constructed, and the method comprises the following processes:
3.1) the input picture is downsampled through a series of convolution layers, and the lower-level features are extracted, wherein the formula for extracting the features is as follows:
where i is 1,2 … … n is the index of each convolutional layer, c, c' is the convolutional layer channels index, ω isiAs a convolution kernel weight matrix, biTo bias, σ is the activation function;
3.2) inputting the extracted features of the last step into two layers of convolution layers for further processing, and extracting local features;
3.3) inputting the features extracted in 3.1) into a network consisting of two convolution layers with the step length of 2 and 3 full-connection layers, and extracting global features;
and 3.4) fusing the local features and the global features through point-by-point affine transformation, performing up-sampling on the fused features, and performing affine transformation to obtain final output.
In the step 4, constructing a character recognition model based on application scene matching correction includes the following processes:
4.1) extracting basic network features with the size of W H C (Width Height Channels), further extracting the features on the features obtained in the previous step by using a sliding window, predicting a plurality of target regions to be selected by using the features, inputting the obtained features into a bidirectional L STM, outputting a result, inputting the result into a full-connection layer to obtain a dense target text, inhibiting redundant noise interference, and merging the obtained text sections into a text line by using a text line construction method;
4.2) constructing a node association graph according to the general rule in the industry after preprocessing the text, voting different nodes, and calculating a voting value (PR L), wherein the calculation formula is as follows:
wherein N isuRepresenting the u-th node in the dependency graph, ruRepresents NuThe similarity degree with all nodes in the graph is shown, d represents a damping coefficient and generally takes 0.85, and e represents NuNumber of entering edges, C (N)v) Representing a node NvDegree of (d);
4.3) randomly walking on the association graph according to the formula, and iteratively calculating the PR L value of each node until a termination formula is satisfied, so that the node score reaches a convergence state, wherein the termination formula is as follows:
where N is the number of nodes, t is the number of random walk steps, σ is the random walk termination threshold, and often σ is 10-4;
4.4) carrying out semantic clustering after sequencing the PR L scores in a descending order, and finally taking out all class centers as the keyword extraction results of the text;
and 4.5) classifying different document texts according to the keywords, and further carrying out next step matching.
In the step 5, the method for constructing the character recognition model based on the matching correction of the professional lexicon comprises the following processes:
5.1) matching the documents separated in the step 4 with corresponding professional word banks according to different types;
5.2) since documents are related to a professional lexicon, the similarity of the words with different agreements is calculated and weighted differently for each lexicon. The matching formula is derived as follows:
where P (w) is the probability of the word w occurring in this sequence. Since the probability value ranges from 0 to 1, the relationship between the data is not easy to find, in order to better perform statistical inference on the data, the probability P (w) needs to be subjected to data transformation, and the probability score P after the transformation is PF,G1The effect of (a) is to change the logarithm from negative to positive to ensure non-negativity of the optimized probability score, G2The effect of (1) is to ensure that the probability score ranges between 0 and 1, in this formula G2=G1+1;
Candidate words matched in different word banks have different possibilities, so that a weight coefficient theta is introduced to optimize the sorting problem of the candidate words when the candidate words are sorted, and a calculation formula for determining the overall sorting score of the candidate words is as follows:
ES=θ·PF(5)
the higher the similarity, the larger θ. The formula for θ is as follows:
wherein L DPFor the font similarity of the original recognized vocabulary and the candidate words, L DwH is a reward and punishment coefficient of the similarity of the word pronunciation of the original recognized word and the candidate word, and the range is 0<h is less than or equal to 1, and the higher the similarity is, the closer h is to 1;
5.3) the first 5 words of the maximum ES value to be found are sent to the next step.
In the step 6, the method for constructing the text recognition model based on the context matching correction comprises the following processes:
6.1) carrying out context matching on the result obtained in the last step to identify characters, wherein the vector expression calculation formula is as follows:
v(wi)=argmaxjsim(vcontext(wi),w(i,j)) (7)
wherein wiFor the entered word, vcontext(wi) Is a context vector, w(i, j) is the multi-prototype vector representation content, resulting in v (w)i) Then, calculating the global similarity, and taking the words with high similarity as the final matching result;
6.2) replacing the original vocabulary with the final matching result and putting the final matching result into the text, and outputting the final matching text.
In the step 7, the identification result is structured, which includes the following processes:
7.1) summarizing the identified document texts, unifying the coding format and outputting the document texts in Word format.
7.2) reading Word content, processing the Word content according to a set rule, and storing the Word content in the csv in a structured form.
In the medical document identification method provided by the embodiment of the present invention, the instruction included in the program code may be used to execute the method described in the foregoing method embodiment, and specific implementation may refer to the method embodiment, and details are not described here.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of the system described above may refer to the corresponding process in the foregoing method embodiment, and is not described herein again.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.
Claims (8)
1. A method for identifying documents based on environment multi-type word stock is characterized by comprising the following steps:
step 1: collecting electronic documents, constructing a data set, collecting professional vocabularies of various industries and constructing a professional lexicon;
step 2: preprocessing the document image in the data set;
and step 3: constructing an image enhancement model for enhancing and denoising the document;
and 4, step 4: constructing a character recognition model based on application scene matching correction;
and 5: constructing a character recognition model based on professional word stock matching correction;
step 6: constructing a character recognition model based on context matching correction;
and 7: and carrying out structuring processing on the recognition result.
2. The method of claim 1, wherein the method comprises the steps of: in the step 1, collecting electronic edition documents, wherein the behavior data does not comprise a part with obvious problems in quality, and the part with obvious problems in quality comprises missing more than forty percent of contents or serious overexposure; professional vocabularies of various industries are collected, and the behavior data avoids the repetition of the professional vocabularies among the industries as much as possible so as to avoid generating redundant interference on the next steps.
3. A method for identifying documents based on environmental multi-part word stocks according to claim 1 or 2, characterized in that: in the step 2, the preprocessing of the documents in the data set includes the following processes:
2.1) removing the seal for the seal part contained in the document;
2.2) mapping a curve or a straight line with a shape in a rectangular coordinate system where the picture is located to a point in the Hough space to form a peak value by utilizing the transformation between the space where the picture is located and the Hough space of the document image after the last step of processing, and searching an angle corresponding to the peak value to perform image inclination correction.
4. A method for identifying documents based on environmental multi-part word stocks according to claim 1 or 2, characterized in that: in the step 3, an image enhancement model for enhancing and denoising the document is constructed, and the method comprises the following processes:
3.1) the input picture is downsampled through a series of convolution layers, and the lower-level features are extracted, wherein the formula for extracting the features is as follows:
where i is 1,2 … … n is the index of each convolutional layer, c, c' is the convolutional layer channels index, ω isiAs a convolution kernel weight matrix, biTo bias, σ is the activation function;
3.2) inputting the extracted features of the last step into two layers of convolution layers for further processing, and extracting local features;
3.3) inputting the features extracted in 3.1) into a network consisting of two convolution layers with the step length of 2 and 3 full-connection layers, and extracting global features;
3.4) fusing the local features and the global features through point-by-point affine transformation. And performing up-sampling on the fusion features, and performing affine transformation to obtain final output.
5. A method for identifying documents based on environmental multi-part word stocks according to claim 1 or 2, characterized in that: in the step 4, constructing a character recognition model based on application scene matching correction includes the following processes:
4.1) extracting basic network features, wherein the size is W x H x C, further extracting the features on the features obtained in the previous step by using a sliding window, and predicting a plurality of target regions to be selected by using the features;
4.2) constructing a node association graph according to the general rule in the industry after preprocessing the text, voting different nodes, and calculating a voting value (PR L), wherein the calculation formula is as follows:
wherein N isuRepresenting the u-th node in the dependency graph, ruRepresents NuDegree of similarity to all nodes in the graph, d denotes damping coefficient, e denotes NuNumber of entering edges, C (N)v) Representing a node NvDegree of (d);
4.3) randomly walking on the association graph according to the formula, and iteratively calculating the PR L value of each node until a termination formula is satisfied, so that the node score reaches a convergence state, wherein the termination formula is as follows:
wherein N is the number of nodes, t is the number of random walk steps, and sigma is the random walk termination threshold;
4.4) carrying out semantic clustering after sequencing the PR L scores in a descending order, and finally taking out all class centers as the keyword extraction results of the text;
and 4.5) classifying different document texts according to the keywords, and further carrying out next step matching.
6. A method for identifying documents based on environmental multi-part word stocks according to claim 1 or 2, characterized in that: in the step 5, the method for constructing the character recognition model based on the matching correction of the professional lexicon comprises the following processes:
5.1) matching the documents separated in the step 4 with corresponding professional word banks according to different types;
5.2) since the receipt is related to a professional word stock, the similarity of the words with different issues is calculated for each word stock and is added with different weights, and the matching formula is derived as follows:
wherein P (w) is the probability of the word w in the sequence, the relationship between the data is not easy to be found because the probability value ranges from 0 to 1, the probability P (w) needs to be subjected to data transformation for better statistical inference of the data, and the probability score P after the transformationF,G1The effect of (a) is to change the logarithm from negative to positive to ensure non-negativity of the optimized probability score, G2The effect of (1) is to ensure that the probability score ranges between 0 and 1, in this formula G2=G1+1;
Candidate words matched in different word banks have different possibilities, so that a weight coefficient theta is introduced to optimize the sorting problem of the candidate words when the candidate words are sorted, and a calculation formula for determining the overall sorting score of the candidate words is as follows:
ES=θ·PF(5)
the higher the similarity is, the larger theta is, and the calculation formula of theta is as follows:
wherein L DPFor the font similarity of the original recognized vocabulary and the candidate words, L DwH is a reward and punishment coefficient, the similarity between the original recognized word and the word pronunciation of the candidate word is more than 0 and less than or equal to 1, and the higher the similarity is, the closer h is to 1;
5.3) the first 5 words of the maximum ES value to be found are sent to the next step.
7. A method for identifying documents based on environmental multi-part word stocks according to claim 1 or 2, characterized in that: in the step 6, the method for constructing the text recognition model based on the context matching correction comprises the following processes:
6.1) carrying out context matching on the result obtained in the last step to identify characters, wherein the vector expression calculation formula is as follows:
v(wi)=argmaxjsim(vcontext(wi),w(i,j)) (7)
wherein wiFor the entered word, vcontext(wi) For the context vector, w (i, j) is the multi-prototype vector representation content, resulting in v (w)i) Then, calculating the global similarity, and taking the words with high similarity as the final matching result;
6.2) putting the final matching result into the text and outputting the final matching text.
8. A method for identifying documents based on environmental multi-part word stocks according to claim 1 or 2, characterized in that: in the step 7, the identification result is structured, which includes the following processes:
7.1) summarizing the identified document texts, unifying coding formats and outputting the documents in Word format;
7.2) reading Word content, processing the Word content according to a set rule, and storing the Word content in the csv in a structured form.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010122436.3A CN111461109B (en) | 2020-02-27 | 2020-02-27 | Method for identifying documents based on environment multi-class word stock |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010122436.3A CN111461109B (en) | 2020-02-27 | 2020-02-27 | Method for identifying documents based on environment multi-class word stock |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111461109A true CN111461109A (en) | 2020-07-28 |
CN111461109B CN111461109B (en) | 2023-09-15 |
Family
ID=71685055
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010122436.3A Active CN111461109B (en) | 2020-02-27 | 2020-02-27 | Method for identifying documents based on environment multi-class word stock |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111461109B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112633389A (en) * | 2020-12-28 | 2021-04-09 | 西北工业大学 | Method for calculating trend of hurricane motion track based on MDL and speed direction |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP3139279A1 (en) * | 2015-09-01 | 2017-03-08 | Dream It Get IT Limited | Media unit retrieval and related processes |
US20180137551A1 (en) * | 2016-11-11 | 2018-05-17 | Ebay Inc. | Intelligent online personal assistant with image text localization |
CN108132927A (en) * | 2017-12-07 | 2018-06-08 | 西北师范大学 | A kind of fusion graph structure and the associated keyword extracting method of node |
CN110321925A (en) * | 2019-05-24 | 2019-10-11 | 中国工程物理研究院计算机应用研究所 | A kind of more granularity similarity comparison methods of text based on semantics fusion fingerprint |
-
2020
- 2020-02-27 CN CN202010122436.3A patent/CN111461109B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP3139279A1 (en) * | 2015-09-01 | 2017-03-08 | Dream It Get IT Limited | Media unit retrieval and related processes |
US20180137551A1 (en) * | 2016-11-11 | 2018-05-17 | Ebay Inc. | Intelligent online personal assistant with image text localization |
CN108132927A (en) * | 2017-12-07 | 2018-06-08 | 西北师范大学 | A kind of fusion graph structure and the associated keyword extracting method of node |
CN110321925A (en) * | 2019-05-24 | 2019-10-11 | 中国工程物理研究院计算机应用研究所 | A kind of more granularity similarity comparison methods of text based on semantics fusion fingerprint |
Non-Patent Citations (4)
Title |
---|
YU-JIE XIONG等: "Chinese Writer Identification Using Contour-Directional Feature and Character Pair Similarity Measurement" * |
李琴等: "医学文本图像字符识别校正技术研究与应用" * |
邵文良: "基于深度学习的医疗单据图文识别关键技术研究与实现" * |
郭鸿奇等: "一种基于词语多原型向量表示的句子相似度计算方法" * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112633389A (en) * | 2020-12-28 | 2021-04-09 | 西北工业大学 | Method for calculating trend of hurricane motion track based on MDL and speed direction |
CN112633389B (en) * | 2020-12-28 | 2024-01-23 | 西北工业大学 | Hurricane movement track trend calculation method based on MDL and speed direction |
Also Published As
Publication number | Publication date |
---|---|
CN111461109B (en) | 2023-09-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
RU2757713C1 (en) | Handwriting recognition using neural networks | |
WO2007080642A1 (en) | Sheet slip processing program and sheet slip program device | |
CN113033183B (en) | Network new word discovery method and system based on statistics and similarity | |
CN111461108A (en) | Medical document identification method | |
Gordo et al. | Document classification and page stream segmentation for digital mailroom applications | |
Boillet et al. | Robust text line detection in historical documents: learning and evaluation methods | |
CN114818718A (en) | Contract text recognition method and device | |
US8340428B2 (en) | Unsupervised writer style adaptation for handwritten word spotting | |
JPH11328317A (en) | Method and device for correcting japanese character recognition error and recording medium with error correcting program recorded | |
Al Ghamdi | A novel approach to printed Arabic optical character recognition | |
CN111461109B (en) | Method for identifying documents based on environment multi-class word stock | |
Chandra et al. | Optical character recognition-A review | |
Reisi et al. | Authorship attribution in historical and literary texts by a deep learning classifier | |
Vitadhani et al. | Detection of clickbait thumbnails on YouTube using tesseract-OCR, face recognition, and text alteration | |
CN113762160A (en) | Date extraction method and device, computer equipment and storage medium | |
US20140093173A1 (en) | Classifying a string formed from hand-written characters | |
Desai et al. | A Survey On Automatic Subjective Answer Evaluation | |
CN113051886A (en) | Test question duplicate checking method and device, storage medium and equipment | |
Oprean et al. | Handwritten word recognition using Web resources and recurrent neural networks | |
CN111611379A (en) | Text information classification method, device, equipment and readable storage medium | |
Nisa et al. | Annotation of struck-out text in handwritten documents | |
Maarouf et al. | Correcting optical character recognition result via a novel approach | |
US20230419367A1 (en) | Apparatus and method for communicating with users | |
US20230343122A1 (en) | Performing optical character recognition based on fuzzy pattern search generated using image transformation | |
Nordin Hällgren | Reading Key Figures from Annual Reports |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |