CN110288052A - Character identifying method, device, equipment and computer-readable medium - Google Patents
Character identifying method, device, equipment and computer-readable medium Download PDFInfo
- Publication number
- CN110288052A CN110288052A CN201910697687.1A CN201910697687A CN110288052A CN 110288052 A CN110288052 A CN 110288052A CN 201910697687 A CN201910697687 A CN 201910697687A CN 110288052 A CN110288052 A CN 110288052A
- Authority
- CN
- China
- Prior art keywords
- character
- identification
- text
- identification text
- pca
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 90
- 230000009467 reduction Effects 0.000 claims abstract description 43
- 230000003287 optical effect Effects 0.000 claims abstract description 8
- 239000011159 matrix material Substances 0.000 claims description 17
- 230000006870 function Effects 0.000 claims description 12
- 239000000284 extract Substances 0.000 claims description 10
- 238000004590 computer program Methods 0.000 claims description 6
- 230000015654 memory Effects 0.000 claims description 6
- 238000002372 labelling Methods 0.000 claims description 3
- 230000032258 transport Effects 0.000 claims 1
- 238000005516 engineering process Methods 0.000 abstract description 7
- 238000012015 optical character recognition Methods 0.000 description 18
- 230000008569 process Effects 0.000 description 12
- 238000000605 extraction Methods 0.000 description 8
- 238000012545 processing Methods 0.000 description 8
- 230000002159 abnormal effect Effects 0.000 description 3
- 230000008901 benefit Effects 0.000 description 3
- 238000013461 design Methods 0.000 description 3
- 239000000203 mixture Substances 0.000 description 3
- 230000003321 amplification Effects 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000003199 nucleic acid amplification method Methods 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- 241001070941 Castanea Species 0.000 description 1
- 235000014036 Castanea Nutrition 0.000 description 1
- 239000006002 Pepper Substances 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 239000003086 colorant Substances 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 238000000354 decomposition reaction Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 230000004069 differentiation Effects 0.000 description 1
- 238000003384 imaging method Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 239000011148 porous material Substances 0.000 description 1
- 238000011946 reduction process Methods 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 241000894007 species Species 0.000 description 1
- 239000000758 substrate Substances 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/213—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/30—Noise filtering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Physics & Mathematics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Multimedia (AREA)
- Character Discrimination (AREA)
Abstract
This application involves a kind of character identifying method, device, equipment and computer-readable mediums.The described method includes: obtaining the scanning file of file destination, and image procossing is carried out to the scanning file;Character recognition is carried out to the target image that image procossing obtains using optical character identification OCR technique, obtains the first identification text;Wherein, when carrying out character recognition using the OCR technique, dimensionality reduction is carried out to the character feature in the target image using R1_PCA.The application in OCR Text region Feature Dimension Reduction by using R1_PCA dimensionality reduction technology, it is blended using R1_PCA and OCR technique, the interference of noise can be reduced using this dimension reduction method, to promote the accuracy of OCR technique in character features there are when noise.
Description
Technical field
This application involves field of computer technology more particularly to a kind of character identifying method, device, equipment and computer can
Read medium.
Background technique
As the temperature of artificial intelligence rises, this field of image recognition is also gradually of interest by people.Optical character is known
(Optical Character Recognition, OCR) does not refer to that electronic equipment (such as scanner or digital camera) checks paper
The character of upper printing determines its shape by the mode for detecting dark, bright, shape is then translated into calculating with character identifying method
The process of machine text.
However, tradition using OCR technique to character recognition when used dimension reduction method such as PCA, LDA etc., be all with L2
The distance metric square as loss function of norm, when, there are when noise, PCA, LDA do not have robustness in feature, because
Objective function is that error sum of squares (L2 norm) makes these algorithms have amplification to exceptional value, and small abnormal data is all
It may to estimate that subspace deviation is larger, can not reflect true situation.It is sensitive to the exceptional value (noise) in sample.
Summary of the invention
In order to solve the above-mentioned technical problem or it at least is partially solved above-mentioned technical problem, this application provides a kind of words
Accord with recognition methods, device, equipment and computer-readable medium.
In a first aspect, this application provides a kind of character identifying methods, comprising:
The scanning file of file destination is obtained, and image procossing is carried out to the scanning file;
Character recognition is carried out to the target image that image procossing obtains using optical character identification OCR technique, obtains first
Identify text;Wherein, when carrying out character recognition using the OCR technique, using R1_PCA to the word in the target image
It accords with feature and carries out dimensionality reduction.
Optionally, distance metric of the R1_PCA using the first power of R1 norm as loss function:
Wherein, X ∈ Rm×nIndicate that character features extract matrix, U ∈ Rm×dIndicate axis of projection, V=UTX indicates the text after dimensionality reduction
Word eigenmatrix.
Optionally, the method also includes:
Obtain the pdf document of the file destination;
Identify the second identification text in the pdf document;
By the second identification text compared with the first identification text, the first identification text and described the is determined
Difference character between two identification texts.
Optionally, the method also includes:
The difference character is marked in the first identification text and the second identification text;
And/or the difference word in the second identification text is replaced using the difference character in the first identification text
Symbol;
And/or the difference word in the first identification text is replaced using the difference character in the second identification text
Symbol.
Second aspect, the application also provide a kind of character recognition device, comprising:
First obtains module, carries out image procossing for obtaining the scanning file of file destination, and to the scanning file;
First identification module, for being carried out using optical character identification OCR technique to the target image that image procossing obtains
Character recognition obtains the first identification text;Wherein, when carrying out character recognition using the OCR technique, using R1_PCA to institute
The character feature stated in target image carries out dimensionality reduction.
Optionally, distance metric of the R1_PCA using the first power of R1 norm as loss function:
Wherein, X ∈ Rm×nIndicate that character features extract matrix, U ∈ Rm×dIndicate axis of projection, V=UTX indicates the text after dimensionality reduction
Word eigenmatrix.
Optionally, described device further include:
Second obtains module, for obtaining the pdf document of the file destination;
Second identification module, second in the pdf document identifies text for identification;
Comparison module, for the second identification text compared with the first identification text, to be determined that described first knows
Difference character between other text and the second identification text.
Optionally, described device further include:
Labeling module, for marking the difference character in the first identification text and the second identification text;
And/or first replacement module, know for replacing described second using the difference character in the first identification text
Difference character in other text;
And/or second replacement module, know for replacing described first using the difference character in the second identification text
Difference character in other text.
The third aspect, present invention also provides a kind of character recognition device, including memory, processor, the memories
In be stored with the computer program that can be run on the processor, the processor is realized when executing the computer program
The step of stating method described in first aspect.
Fourth aspect, present invention also provides a kind of computers of non-volatile program code that can be performed with processor
Readable medium, said program code make the processor execute method described in the first aspect.
Above-mentioned technical proposal provided by the embodiments of the present application has the advantages that compared with prior art
The application uses R1_PCA and OCR skill by using R1_PCA dimensionality reduction technology in OCR Text region Feature Dimension Reduction
Art blends, and the interference of noise can be reduced using this dimension reduction method, to be promoted in character features there are when noise
The accuracy of OCR technique.
Detailed description of the invention
The drawings herein are incorporated into the specification and forms part of this specification, and shows and meets implementation of the invention
Example, and be used to explain the principle of the present invention together with specification.
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below
There is attached drawing needed in technical description to be briefly described, it should be apparent that, for those of ordinary skill in the art
Speech, without any creative labor, is also possible to obtain other drawings based on these drawings.
Fig. 1 is a kind of a kind of flow chart of character identifying method provided by the embodiments of the present application;
Fig. 2 is that a kind of dimensionality reduction provided by the embodiments of the present application influences to compare schematic diagram;
Fig. 3 is the schematic diagram of species diversity character label provided by the embodiments of the present application;
Fig. 4 is a kind of structure chart of character recognition device provided by the embodiments of the present application.
Specific embodiment
To keep the purposes, technical schemes and advantages of the embodiment of the present application clearer, below in conjunction with the embodiment of the present application
In attached drawing, the technical scheme in the embodiment of the application is clearly and completely described, it is clear that described embodiment is
A part of the embodiment of the application, instead of all the embodiments.Based on the embodiment in the application, ordinary skill people
Member's every other embodiment obtained without making creative work, shall fall in the protection scope of this application.
Used dimension reduction method such as PCA, LDA etc., are with L2 norm when due to using OCR technique to character recognition
Distance metric square as loss function, when, there are when noise, PCA, LDA do not have robustness, because of target letter in feature
Number is that error sum of squares (L2 norm) makes these algorithms have amplification to exceptional value, and small abnormal data may all make
It must estimate that subspace deviation is larger, can not reflect true situation.It is sensitive to the exceptional value (noise) in sample.For this purpose, this Shen
Please embodiment a kind of character identifying method is provided, as shown in Figure 1, the method may include following steps:
Step S101 obtains the scanning file of file destination, and carries out image procossing to the scanning file;
In the embodiment of the present application, file destination can be illustratively the paper files such as contract, the scanning of file destination
File is the file obtained after being scanned file destination.
In practical applications, image procossing refers to before identifying text, pre-processes to original image, so as to subsequent
Feature extraction and study.This process generally comprises: the sub-steps such as gray processing, binaryzation, noise reduction, Slant Rectify, character segmentation.
Gray processing, in RGB model, if when R=G=B, colour indicates a kind of greyscale color, wherein R=G=B
Value is gray value, and therefore, each pixel of gray level image only needs byte storage gray value (also known as intensity value, brightness value), ash
Degree range is 0-255.It succeeds in reaching an agreement custom a bit, a color image is exactly become into black and white picture.The generally important method of gray processing, most
Four kinds of big value method, mean value method, weighted mean method methods carry out gray processing to color image.
Binaryzation, piece image includes target object, there are also noises for background, to directly mention from the digital picture of multivalue
Target object is taken out, most common method is exactly to set a threshold value T, and the data of image are divided into two parts: greater than T with T
Pixel group and pixel group less than T.This is the most special method for studying greyscale transformation, the referred to as binaryzation of image.Binaryzation
Black and white picture do not include grey, only pure white and two kinds of colors of black.In binaryzation it is most important be exactly threshold value selection, one
As be divided into fixed threshold and adaptive threshold.More commonly used binarization method then has: Two-peak method, P parametric method, iterative method and
OTSU method etc..
Image noise reduction, reality in digital picture digitlization and transmission process in be subjected to imaging device and external environment
Noise jamming etc. influences, referred to as noisy image or noise image.The process for reducing noise in digital picture is known as image noise reduction.Figure
The source of noise is there are many kind as in, these noise sources are in various aspects such as Image Acquisition, transmission, compressions.The type of noise
Also different, for example salt-pepper noise, Gaussian noise etc. have different Processing Algorithms for different noises.It is obtained in previous step
To image in it can be seen that many fragmentary pores, here it is the noises in image, can greatly interfere with our programs pair
In the cutting and identification of picture, therefore we need noise reduction process.Noise reduction is extremely important in this stage, the quality of noise reduction algorithm
Influence to feature extraction is very big.
Slant Rectify, for a user, impossible absolute level when taking pictures, so, it would be desirable to pass through journey
Image is done rotation processing by sequence, and to look for the position for thinking most probable level, the figure cut out in this way is possible to be best
An effect.The most common method of Slant Rectify is Hough transformation, and principle is that picture is carried out expansion process, will be interrupted
Text draws a straight line, and is convenient for straight-line detection.After calculating the angle of straight line picture can will be tilted using Rotation Algorithm
It is remedied to horizontal position.
Step S102 carries out character recognition to the target image that image procossing obtains using optical character identification OCR technique,
Obtain the first identification text;
In the embodiment of the present application, feature extraction and dimensionality reduction can be carried out first when character recognition, be characterized in for identifying
The key message of text, each different text can be transferred through feature to distinguish with other texts.For number and English
For letter, this feature extraction is to be relatively easy to, and in total with regard to 10+26x 2=52 character, and is all small size character set.
For Chinese character, the difficulty of feature extraction is with regard to bigger, because Chinese character is large character set first;Secondly light is most in national standard
Common first order Chinese character just has 3755;Last Hanzi structure is complicated, and nearly word form is more, and characteristic dimension is with regard to bigger.In determination
Using after which kind of feature, it is also possible to Yao Jinhang Feature Dimension Reduction, in this case, and if the dimension of feature is too high, classifier
Efficiency will receive very big influence, in order to improve recognition rate, will often carry out dimensionality reduction, this process is also critically important, both
Intrinsic dimensionality is reduced, it is (different to distinguish that the feature vector after making reduction dimension again also retains enough information content
Text).
Then, the character features extraction in conjunction with R1_PCA dimensionality reduction technology and dimensionality reduction.
In the embodiment of the present application, when carrying out character recognition using the OCR technique, using R1_PCA to the target
Character feature in image carries out dimensionality reduction.
Distance metric of the R1_PCA using the first power of R1 norm as loss function:
Wherein, X ∈ Rm×nIndicate that character features extract matrix, U ∈ Rm×dIndicate axis of projection, V=UTX indicates the text after dimensionality reduction
Word eigenmatrix.
Distance metric by R1_PCA using the first power of R1 norm as loss function, it is possible to reduce the influence of noise.R1_
PCA algorithm first defines invariable rotary L1 norm i.e. R1 norm, and R1 norm is provided simultaneously with the insensitive (Shandong of exceptional value of L1 norm
Stick) and L2 norm rotational invariance.
Therefore when there is noise in extraction feature, R1_PCA algorithm dimensionality reduction is utilized, it is possible to reduce the influence of noise, more preferably
Retain sample in information, have higher accuracy in terms of error reconstruction.
As illustrated in fig. 2, it is assumed that sample drops to one-dimensional from two dimension, the point of black is sample point in figure, and red point is abnormal
Value, i.e. noise spot, when wherein W1 is noiseless point, PCA and the axis of projection after R1_PCA dimensionality reduction, W2 and W3 are respectively to work as to exist to make an uproar
Axis of projection when sound, using R1_PCA, after PCA dimensionality reduction.It, can be by because PCA is using L2 norm square as distance metric
To noise spot large effect.For R1_PCA using R1 norm as distance metric, influence is smaller, and has rotational invariance.
When carrying out dimensionality reduction operation with character features of the R1_PCA algorithm to extraction, R1_PCA has robustness, simultaneously
The solution arrived keeps rotational invariance.If character features extract matrix X=(xij)∈Rm×n, then the expression of the R1 norm of matrix X is such as
Under:
Enable X=(x1,x2,…,xn), thenAssuming that ξ=(| | x1||2,||x2||2,…,||xn||2),
ThenTherefore
Wherein, X=(xij)∈Rm×nIndicate that character features extract matrix.
The verifying that R1 norm meets three characteristics of norm is as follows:
Assuming that A, B are two arbitrary matrixes,By the definition of R1 norm, it is known that
1.And
2.
3.
From the above, it can be seen that R1 norm meets the fundamental property of norm, therefore R1 norm is also a kind of norm.
R1- PCA is defined as follows:
Wherein X indicates that character features extract matrix, and U indicates axis of projection.
Therefrom it can be seen that objective function is that have rotational invariance.At the same time, the institute of the robustness advantage of L1-PCA
It is because the summation in formula (1) is the R1 mould with the model solution projection error of L1 norm to be retained.
It constructs Lagrangian and solves (1):
Wherein U indicates that axis of projection, Λ indicate Lagrange multiplier matrix.
Due to matrix U UTIt is symmetrical, therefore Λ ∈ Rd×dIt is a symmetrical matrix.
LagrangianL (U, Λ) is sought into partial derivative to U and it is enabled to be equal to 0, then (1) optimal solution meets KKT equation:
Wherein CwFor the weighting covariance matrix of R1-PCA:
It follows that and if only if U be Span (ξ1,…,ξd) any group of normal orthogonal substrate when, (1) formula reaches minimum
It is worth, wherein ξiIt is CwI-th of maximum eigenvalue corresponding to normal orthogonal feature vector.
Covariance matrix depends on U, different from L1-PCA, and closed solutions are not present in (1).So directly for CwCarry out feature
Value decomposition is infeasible.Chris Ding proposes orthogonal iteration algorithm.Detailed process:
1) U ∈ R is initializedm×d, and calculate residual
2) U is updated by following formula:
M=CwU,
3) iteration executes step 2), and until convergence, iteration terminates.
It is to use R1_PCA that can reduce exceptional value interference, the conjunction with robustness as character features dimension reduction method above
Rationality, and the method for solving R1_PCA are summarized.
After combining the character features of R1_PCA dimensionality reduction technology to extract and dimensionality reduction, then classifier design, training are carried out, it is right
One character image, extracts feature, loses to classifier, and classifier just classifies to it, tells your this feature the identification
At which text.The design of classifier is exactly our task.The design method of classifier generally has: template matching method, differentiation
Function method, neural network classification method, Process Based method etc., are not unfolded to describe here.Before carrying out practical identification, toward contact
Classifier is trained, this is the process of a supervised learning.Mature classifier also has very much, there is SVM, CNN etc..
It is finally post-processed, namely the classification results of classifier is optimized, this will generally be related to natural language
The scope of understanding.Be the processing of nearly word form first: lifting chestnut, " dividing " and " ", shape was close, but if encounter " score " this
A word should not just be identified as " counting ", because " score " is only a normal word.This need by language model come into
Row is corrected.Followed by for the processing of text composition: such as some books are to be divided to or so two columns, with a line two column of left and right not
Belong to in short, any phraseological connection is not present.It, will be the end of left lateral and opening for right lateral if cut according to row
Head connects together, this is that we are not intended to see, such situation needs to carry out specially treated.
The application uses R1_PCA and OCR skill by using R1_PCA dimensionality reduction technology in OCR Text region Feature Dimension Reduction
Art blends, and the interference of noise can be reduced using this dimension reduction method, to be promoted in character features there are when noise
The accuracy of OCR technique.
In the another embodiment of the application, the method also includes:
Obtain the pdf document of the file destination;
Identify the second identification text in the pdf document;
By the second identification text compared with the first identification text, the first identification text and described the is determined
Difference character between two identification texts.
In the another embodiment of the application, determining between the first identification text and the second identification text
After difference character, the method also includes:
Mark the difference character in the first identification text and the second identification text (referring to Fig. 3);
And/or the difference word in the second identification text is replaced using the difference character in the first identification text
Symbol;
And/or the difference word in the first identification text is replaced using the difference character in the second identification text
Symbol.
In the another embodiment of the application, as shown in figure 4, also providing a kind of character recognition device, comprising:
First obtains module 11, carries out at image for obtaining the scanning file of file destination, and to the scanning file
Reason;
First identification module 12, target image for being obtained using optical character identification OCR technique to image procossing into
Line character identification, obtains the first identification text;Wherein, when carrying out character recognition using the OCR technique, R1_PCA pairs is used
Character feature in the target image carries out dimensionality reduction.
In the another embodiment of the application, distance degree of the R1_PCA using the first power of R1 norm as loss function
Amount:
Wherein, X ∈ Rm×nIndicate that character features extract matrix, U ∈ Rm×dIndicate axis of projection, V=UTX indicates the text after dimensionality reduction
Word eigenmatrix.
In the another embodiment of the application, described device further include:
Second obtains module, for obtaining the pdf document of the file destination;
Second identification module, second in the pdf document identifies text for identification;
Comparison module, for the second identification text compared with the first identification text, to be determined that described first knows
Difference character between other text and the second identification text.
In the another embodiment of the application, described device further include:
Labeling module, for marking the difference character in the first identification text and the second identification text;
And/or first replacement module, know for replacing described second using the difference character in the first identification text
Difference character in other text;
And/or second replacement module, know for replacing described first using the difference character in the second identification text
Difference character in other text.
In the another embodiment of the application, a kind of character recognition device, including memory, processor are also provided, it is described
The computer program that can be run on the processor is stored in memory, when the processor executes the computer program
The step of realizing method described in above method embodiment.
In the another embodiment of the application, a kind of non-volatile program code that can be performed with processor is also provided
Computer-readable medium, said program code make the processor execute method described in preceding method embodiment.
It should be noted that, in this document, the relational terms of such as " first " and " second " or the like are used merely to one
A entity or operation with another entity or operate distinguish, without necessarily requiring or implying these entities or operation it
Between there are any actual relationship or orders.Moreover, the terms "include", "comprise" or its any other variant are intended to
Cover non-exclusive inclusion, so that the process, method, article or equipment for including a series of elements not only includes those
Element, but also including other elements that are not explicitly listed, or further include for this process, method, article or setting
Standby intrinsic element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that
There is also other identical elements in the process, method, article or apparatus that includes the element.
The above is only a specific embodiment of the invention, is made skilled artisans appreciate that or realizing this hair
It is bright.Various modifications to these embodiments will be apparent to one skilled in the art, as defined herein
General Principle can be realized in other embodiments without departing from the spirit or scope of the present invention.Therefore, of the invention
It is not intended to be limited to the embodiments shown herein, and is to fit to and applied principle and features of novelty phase one herein
The widest scope of cause.
Claims (10)
1. a kind of character identifying method characterized by comprising
The scanning file of file destination is obtained, and image procossing is carried out to the scanning file;
Character recognition is carried out to the target image that image procossing obtains using optical character identification OCR technique, obtains the first identification
Text;Wherein, special to the character in the target image using R1_PCA when carrying out character recognition using the OCR technique
Sign carries out dimensionality reduction.
2. character identifying method according to claim 1, which is characterized in that the R1_PCA is made with the first power of R1 norm
For the distance metric of loss function:
Wherein, X ∈ Rm×nIndicate that character features extract matrix, U ∈ Rm×dIndicate axis of projection, V=UTX indicates that the text after dimensionality reduction is special
Levy matrix.
3. character identifying method according to claim 1, which is characterized in that the method also includes:
Obtain the pdf document of the file destination;
Identify the second identification text in the pdf document;
By the second identification text compared with the first identification text, determine that the first identification text and described second is known
Difference character between other text.
4. character identifying method according to claim 3, which is characterized in that the method also includes:
The difference character is marked in the first identification text and the second identification text;
And/or the difference character in the second identification text is replaced using the difference character in the first identification text;
And/or the difference character in the first identification text is replaced using the difference character in the second identification text.
5. a kind of character recognition device characterized by comprising
First obtains module, carries out image procossing for obtaining the scanning file of file destination, and to the scanning file;
First identification module, for carrying out character to the target image that image procossing obtains using optical character identification OCR technique
Identification, obtains the first identification text;Wherein, when carrying out character recognition using the OCR technique, using R1_PCA to the mesh
Character feature in logo image carries out dimensionality reduction.
6. character recognition device according to claim 5, which is characterized in that the R1_PCA is made with the first power of R1 norm
For the distance metric of loss function:
Wherein, X ∈ Rm×nIndicate that character features extract matrix, U ∈ Rm×dIndicate axis of projection, V=UTX indicates that the text after dimensionality reduction is special
Levy matrix.
7. character recognition device according to claim 5, which is characterized in that described device further include:
Second obtains module, for obtaining the pdf document of the file destination;
Second identification module, second in the pdf document identifies text for identification;
Comparison module, for the second identification text compared with the first identification text, to be determined the first identification text
This identifies the difference character between text described second.
8. character recognition device according to claim 7, which is characterized in that described device further include:
Labeling module, for marking the difference character in the first identification text and the second identification text;
And/or first replacement module, for replacing the second identification text using the difference character in the first identification text
Difference character in this;
And/or second replacement module, for replacing the first identification text using the difference character in the second identification text
Difference character in this.
9. a kind of character recognition device, including memory, processor, it is stored with and can transports on the processor in the memory
Capable computer program, which is characterized in that the processor realizes the claims 1 to 4 when executing the computer program
The step of described in any item methods.
10. a kind of computer-readable medium for the non-volatile program code that can be performed with processor, which is characterized in that described
Program code makes the processor execute described any the method for claim 1-4.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910236134.6A CN109919253A (en) | 2019-03-27 | 2019-03-27 | Character identifying method, device, equipment and computer-readable medium |
CN2019102361346 | 2019-03-27 |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110288052A true CN110288052A (en) | 2019-09-27 |
Family
ID=66967035
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910236134.6A Withdrawn CN109919253A (en) | 2019-03-27 | 2019-03-27 | Character identifying method, device, equipment and computer-readable medium |
CN201910697687.1A Pending CN110288052A (en) | 2019-03-27 | 2019-07-30 | Character identifying method, device, equipment and computer-readable medium |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910236134.6A Withdrawn CN109919253A (en) | 2019-03-27 | 2019-03-27 | Character identifying method, device, equipment and computer-readable medium |
Country Status (1)
Country | Link |
---|---|
CN (2) | CN109919253A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113111728A (en) * | 2021-03-22 | 2021-07-13 | 广西电网有限责任公司电力科学研究院 | Intelligent identification method and system for power production operation risk in transformer substation |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111985486A (en) * | 2020-08-31 | 2020-11-24 | 平安医疗健康管理股份有限公司 | Image information identification method and device, storage medium and computer equipment |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101751565A (en) * | 2008-12-10 | 2010-06-23 | 中国科学院自动化研究所 | Method for character identification through fusing binary image and gray level image |
CN105260727A (en) * | 2015-11-12 | 2016-01-20 | 武汉大学 | Academic-literature semantic restructuring method based on image processing and sequence labeling |
CN105335689A (en) * | 2014-08-06 | 2016-02-17 | 阿里巴巴集团控股有限公司 | Character recognition method and apparatus |
CN105550524A (en) * | 2013-07-17 | 2016-05-04 | 中国中医科学院 | Novel clinical case data collection system and collection method |
US20180101726A1 (en) * | 2016-10-10 | 2018-04-12 | Insurance Services Office Inc. | Systems and Methods for Optical Character Recognition for Low-Resolution Documents |
CN108288078A (en) * | 2017-12-07 | 2018-07-17 | 腾讯科技(深圳)有限公司 | Character identifying method, device and medium in a kind of image |
CN108573707A (en) * | 2017-12-27 | 2018-09-25 | 北京金山云网络技术有限公司 | A kind of processing method of voice recognition result, device, equipment and medium |
-
2019
- 2019-03-27 CN CN201910236134.6A patent/CN109919253A/en not_active Withdrawn
- 2019-07-30 CN CN201910697687.1A patent/CN110288052A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101751565A (en) * | 2008-12-10 | 2010-06-23 | 中国科学院自动化研究所 | Method for character identification through fusing binary image and gray level image |
CN105550524A (en) * | 2013-07-17 | 2016-05-04 | 中国中医科学院 | Novel clinical case data collection system and collection method |
CN105335689A (en) * | 2014-08-06 | 2016-02-17 | 阿里巴巴集团控股有限公司 | Character recognition method and apparatus |
CN105260727A (en) * | 2015-11-12 | 2016-01-20 | 武汉大学 | Academic-literature semantic restructuring method based on image processing and sequence labeling |
US20180101726A1 (en) * | 2016-10-10 | 2018-04-12 | Insurance Services Office Inc. | Systems and Methods for Optical Character Recognition for Low-Resolution Documents |
CN108288078A (en) * | 2017-12-07 | 2018-07-17 | 腾讯科技(深圳)有限公司 | Character identifying method, device and medium in a kind of image |
CN108573707A (en) * | 2017-12-27 | 2018-09-25 | 北京金山云网络技术有限公司 | A kind of processing method of voice recognition result, device, equipment and medium |
Non-Patent Citations (1)
Title |
---|
DING C等: "R1-PCA rotational invariant l1-norm principal component analysis for robust subspace factorization", 《INTERNATIONAL CONFERENCE ON MACHINE LEARNING》 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113111728A (en) * | 2021-03-22 | 2021-07-13 | 广西电网有限责任公司电力科学研究院 | Intelligent identification method and system for power production operation risk in transformer substation |
Also Published As
Publication number | Publication date |
---|---|
CN109919253A (en) | 2019-06-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8644616B2 (en) | Character recognition | |
Cheriet et al. | Character recognition systems: a guide for students and practitioners | |
Chou et al. | A binarization method with learning-built rules for document images produced by cameras | |
US8744196B2 (en) | Automatic recognition of images | |
He et al. | Beyond OCR: Multi-faceted understanding of handwritten document characteristics | |
CN112329779B (en) | Method and related device for improving certificate identification accuracy based on mask | |
US11790675B2 (en) | Recognition of handwritten text via neural networks | |
CN112613502A (en) | Character recognition method and device, storage medium and computer equipment | |
Dash et al. | A hybrid feature and discriminant classifier for high accuracy handwritten Odia numeral recognition | |
Cao et al. | Preprocessing of low-quality handwritten documents using markov random fields | |
Sampath et al. | Handwritten optical character recognition by hybrid neural network training algorithm | |
CN110288052A (en) | Character identifying method, device, equipment and computer-readable medium | |
Mishra et al. | Unsupervised refinement of color and stroke features for text binarization | |
CN110533049B (en) | Method and device for extracting seal image | |
Verma et al. | Script identification in natural scene images: a dataset and texture-feature based performance evaluation | |
Aravinda et al. | Template matching method for Kannada handwritten recognition based on correlation analysis | |
Choudhary et al. | A neural approach to cursive handwritten character recognition using features extracted from binarization technique | |
Aggarwal et al. | Text retrieval from scanned forms using optical character recognition | |
Shen et al. | Finding text in natural scenes by figure-ground segmentation | |
Narasimhaiah et al. | Degraded character recognition from old Kannada documents. | |
Sharma et al. | Pattern storage & recalling using Hopfield neural network and HOG feature based SVM classifier: An experiment with handwritten Odia numerals | |
Shine et al. | An approach for improving Optical Character Recognition using Contrast enhancement technique | |
Aggarwal et al. | Content directed enhancement of degraded document images | |
Jacob | Optical Character Recognition system with Projection Profile based segmentation and Deep Learning Techniques | |
Kavallieratou et al. | Document image processing |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information | ||
CB02 | Change of applicant information |
Address after: 411, 4th floor, building 4, No.44, Middle North Third Ring Road, Haidian District, Beijing 100088 Applicant after: Beijing Qingshu Intelligent Technology Co.,Ltd. Address before: 100044 1415, 14th floor, building 1, yard 59, gaoliangqiaoxie street, Haidian District, Beijing Applicant before: BEIJING AISHU WISDOM TECHNOLOGY CO.,LTD. |