CN110288052A

CN110288052A - Character identifying method, device, equipment and computer-readable medium

Info

Publication number: CN110288052A
Application number: CN201910697687.1A
Authority: CN
Inventors: 张晴晴; 徐冉; 段由; 杨金富; 罗磊; 马光谦; 汪洋
Original assignee: BEIJING WISDOM TECHNOLOGY Co Ltd
Current assignee: BEIJING WISDOM TECHNOLOGY Co Ltd
Priority date: 2019-03-27
Filing date: 2019-07-30
Publication date: 2019-09-27
Also published as: CN109919253A

Abstract

This application involves a kind of character identifying method, device, equipment and computer-readable mediums.The described method includes: obtaining the scanning file of file destination, and image procossing is carried out to the scanning file；Character recognition is carried out to the target image that image procossing obtains using optical character identification OCR technique, obtains the first identification text；Wherein, when carrying out character recognition using the OCR technique, dimensionality reduction is carried out to the character feature in the target image using R1_PCA.The application in OCR Text region Feature Dimension Reduction by using R1_PCA dimensionality reduction technology, it is blended using R1_PCA and OCR technique, the interference of noise can be reduced using this dimension reduction method, to promote the accuracy of OCR technique in character features there are when noise.

Description

Character identifying method, device, equipment and computer-readable medium

Technical field

This application involves field of computer technology more particularly to a kind of character identifying method, device, equipment and computer can Read medium.

Background technique

As the temperature of artificial intelligence rises, this field of image recognition is also gradually of interest by people.Optical character is known (Optical Character Recognition, OCR) does not refer to that electronic equipment (such as scanner or digital camera) checks paper The character of upper printing determines its shape by the mode for detecting dark, bright, shape is then translated into calculating with character identifying method The process of machine text.

However, tradition using OCR technique to character recognition when used dimension reduction method such as PCA, LDA etc., be all with L2 The distance metric square as loss function of norm, when, there are when noise, PCA, LDA do not have robustness in feature, because Objective function is that error sum of squares (L2 norm) makes these algorithms have amplification to exceptional value, and small abnormal data is all It may to estimate that subspace deviation is larger, can not reflect true situation.It is sensitive to the exceptional value (noise) in sample.

Summary of the invention

In order to solve the above-mentioned technical problem or it at least is partially solved above-mentioned technical problem, this application provides a kind of words Accord with recognition methods, device, equipment and computer-readable medium.

In a first aspect, this application provides a kind of character identifying methods, comprising:

The scanning file of file destination is obtained, and image procossing is carried out to the scanning file；

Character recognition is carried out to the target image that image procossing obtains using optical character identification OCR technique, obtains first Identify text；Wherein, when carrying out character recognition using the OCR technique, using R1_PCA to the word in the target image It accords with feature and carries out dimensionality reduction.

Optionally, distance metric of the R1_PCA using the first power of R1 norm as loss function:

Wherein, X ∈ R^m×nIndicate that character features extract matrix, U ∈ R^m×dIndicate axis of projection, V=U^TX indicates the text after dimensionality reduction Word eigenmatrix.

Optionally, the method also includes:

Obtain the pdf document of the file destination；

Identify the second identification text in the pdf document；

By the second identification text compared with the first identification text, the first identification text and described the is determined Difference character between two identification texts.

Optionally, the method also includes:

The difference character is marked in the first identification text and the second identification text；

And/or the difference word in the second identification text is replaced using the difference character in the first identification text Symbol；

And/or the difference word in the first identification text is replaced using the difference character in the second identification text Symbol.

Second aspect, the application also provide a kind of character recognition device, comprising:

First obtains module, carries out image procossing for obtaining the scanning file of file destination, and to the scanning file；

First identification module, for being carried out using optical character identification OCR technique to the target image that image procossing obtains Character recognition obtains the first identification text；Wherein, when carrying out character recognition using the OCR technique, using R1_PCA to institute The character feature stated in target image carries out dimensionality reduction.

Optionally, described device further include:

Second obtains module, for obtaining the pdf document of the file destination；

Second identification module, second in the pdf document identifies text for identification；

Comparison module, for the second identification text compared with the first identification text, to be determined that described first knows Difference character between other text and the second identification text.

Optionally, described device further include:

Labeling module, for marking the difference character in the first identification text and the second identification text；

And/or first replacement module, know for replacing described second using the difference character in the first identification text Difference character in other text；

And/or second replacement module, know for replacing described first using the difference character in the second identification text Difference character in other text.

The third aspect, present invention also provides a kind of character recognition device, including memory, processor, the memories In be stored with the computer program that can be run on the processor, the processor is realized when executing the computer program The step of stating method described in first aspect.

Fourth aspect, present invention also provides a kind of computers of non-volatile program code that can be performed with processor Readable medium, said program code make the processor execute method described in the first aspect.

Above-mentioned technical proposal provided by the embodiments of the present application has the advantages that compared with prior art

The application uses R1_PCA and OCR skill by using R1_PCA dimensionality reduction technology in OCR Text region Feature Dimension Reduction Art blends, and the interference of noise can be reduced using this dimension reduction method, to be promoted in character features there are when noise The accuracy of OCR technique.

Detailed description of the invention

The drawings herein are incorporated into the specification and forms part of this specification, and shows and meets implementation of the invention Example, and be used to explain the principle of the present invention together with specification.

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, for those of ordinary skill in the art Speech, without any creative labor, is also possible to obtain other drawings based on these drawings.

Fig. 1 is a kind of a kind of flow chart of character identifying method provided by the embodiments of the present application；

Fig. 2 is that a kind of dimensionality reduction provided by the embodiments of the present application influences to compare schematic diagram；

Fig. 3 is the schematic diagram of species diversity character label provided by the embodiments of the present application；

Fig. 4 is a kind of structure chart of character recognition device provided by the embodiments of the present application.

Specific embodiment

To keep the purposes, technical schemes and advantages of the embodiment of the present application clearer, below in conjunction with the embodiment of the present application In attached drawing, the technical scheme in the embodiment of the application is clearly and completely described, it is clear that described embodiment is A part of the embodiment of the application, instead of all the embodiments.Based on the embodiment in the application, ordinary skill people Member's every other embodiment obtained without making creative work, shall fall in the protection scope of this application.

Used dimension reduction method such as PCA, LDA etc., are with L2 norm when due to using OCR technique to character recognition Distance metric square as loss function, when, there are when noise, PCA, LDA do not have robustness, because of target letter in feature Number is that error sum of squares (L2 norm) makes these algorithms have amplification to exceptional value, and small abnormal data may all make It must estimate that subspace deviation is larger, can not reflect true situation.It is sensitive to the exceptional value (noise) in sample.For this purpose, this Shen Please embodiment a kind of character identifying method is provided, as shown in Figure 1, the method may include following steps:

Step S101 obtains the scanning file of file destination, and carries out image procossing to the scanning file；

In the embodiment of the present application, file destination can be illustratively the paper files such as contract, the scanning of file destination File is the file obtained after being scanned file destination.

In practical applications, image procossing refers to before identifying text, pre-processes to original image, so as to subsequent Feature extraction and study.This process generally comprises: the sub-steps such as gray processing, binaryzation, noise reduction, Slant Rectify, character segmentation.

Gray processing, in RGB model, if when R=G=B, colour indicates a kind of greyscale color, wherein R=G=B Value is gray value, and therefore, each pixel of gray level image only needs byte storage gray value (also known as intensity value, brightness value), ash Degree range is 0-255.It succeeds in reaching an agreement custom a bit, a color image is exactly become into black and white picture.The generally important method of gray processing, most Four kinds of big value method, mean value method, weighted mean method methods carry out gray processing to color image.

Binaryzation, piece image includes target object, there are also noises for background, to directly mention from the digital picture of multivalue Target object is taken out, most common method is exactly to set a threshold value T, and the data of image are divided into two parts: greater than T with T Pixel group and pixel group less than T.This is the most special method for studying greyscale transformation, the referred to as binaryzation of image.Binaryzation Black and white picture do not include grey, only pure white and two kinds of colors of black.In binaryzation it is most important be exactly threshold value selection, one As be divided into fixed threshold and adaptive threshold.More commonly used binarization method then has: Two-peak method, P parametric method, iterative method and OTSU method etc..

Image noise reduction, reality in digital picture digitlization and transmission process in be subjected to imaging device and external environment Noise jamming etc. influences, referred to as noisy image or noise image.The process for reducing noise in digital picture is known as image noise reduction.Figure The source of noise is there are many kind as in, these noise sources are in various aspects such as Image Acquisition, transmission, compressions.The type of noise Also different, for example salt-pepper noise, Gaussian noise etc. have different Processing Algorithms for different noises.It is obtained in previous step To image in it can be seen that many fragmentary pores, here it is the noises in image, can greatly interfere with our programs pair In the cutting and identification of picture, therefore we need noise reduction process.Noise reduction is extremely important in this stage, the quality of noise reduction algorithm Influence to feature extraction is very big.

Slant Rectify, for a user, impossible absolute level when taking pictures, so, it would be desirable to pass through journey Image is done rotation processing by sequence, and to look for the position for thinking most probable level, the figure cut out in this way is possible to be best An effect.The most common method of Slant Rectify is Hough transformation, and principle is that picture is carried out expansion process, will be interrupted Text draws a straight line, and is convenient for straight-line detection.After calculating the angle of straight line picture can will be tilted using Rotation Algorithm It is remedied to horizontal position.

Step S102 carries out character recognition to the target image that image procossing obtains using optical character identification OCR technique, Obtain the first identification text；

In the embodiment of the present application, feature extraction and dimensionality reduction can be carried out first when character recognition, be characterized in for identifying The key message of text, each different text can be transferred through feature to distinguish with other texts.For number and English For letter, this feature extraction is to be relatively easy to, and in total with regard to 10+26x 2=52 character, and is all small size character set. For Chinese character, the difficulty of feature extraction is with regard to bigger, because Chinese character is large character set first；Secondly light is most in national standard Common first order Chinese character just has 3755；Last Hanzi structure is complicated, and nearly word form is more, and characteristic dimension is with regard to bigger.In determination Using after which kind of feature, it is also possible to Yao Jinhang Feature Dimension Reduction, in this case, and if the dimension of feature is too high, classifier Efficiency will receive very big influence, in order to improve recognition rate, will often carry out dimensionality reduction, this process is also critically important, both Intrinsic dimensionality is reduced, it is (different to distinguish that the feature vector after making reduction dimension again also retains enough information content Text).

Then, the character features extraction in conjunction with R1_PCA dimensionality reduction technology and dimensionality reduction.

In the embodiment of the present application, when carrying out character recognition using the OCR technique, using R1_PCA to the target Character feature in image carries out dimensionality reduction.

Distance metric of the R1_PCA using the first power of R1 norm as loss function:

Distance metric by R1_PCA using the first power of R1 norm as loss function, it is possible to reduce the influence of noise.R1_ PCA algorithm first defines invariable rotary L1 norm i.e. R1 norm, and R1 norm is provided simultaneously with the insensitive (Shandong of exceptional value of L1 norm Stick) and L2 norm rotational invariance.

Therefore when there is noise in extraction feature, R1_PCA algorithm dimensionality reduction is utilized, it is possible to reduce the influence of noise, more preferably Retain sample in information, have higher accuracy in terms of error reconstruction.

As illustrated in fig. 2, it is assumed that sample drops to one-dimensional from two dimension, the point of black is sample point in figure, and red point is abnormal Value, i.e. noise spot, when wherein W1 is noiseless point, PCA and the axis of projection after R1_PCA dimensionality reduction, W2 and W3 are respectively to work as to exist to make an uproar Axis of projection when sound, using R1_PCA, after PCA dimensionality reduction.It, can be by because PCA is using L2 norm square as distance metric To noise spot large effect.For R1_PCA using R1 norm as distance metric, influence is smaller, and has rotational invariance.

When carrying out dimensionality reduction operation with character features of the R1_PCA algorithm to extraction, R1_PCA has robustness, simultaneously The solution arrived keeps rotational invariance.If character features extract matrix X=(x_ij)∈R^m×n, then the expression of the R1 norm of matrix X is such as Under:

Enable X=(x₁,x₂,…,x_n), thenAssuming that ξ=(| | x₁||₂,||x₂||₂,…,||x_n||₂), ThenTherefore

Wherein, X=(x_ij)∈R^m×nIndicate that character features extract matrix.

The verifying that R1 norm meets three characteristics of norm is as follows:

Assuming that A, B are two arbitrary matrixes,By the definition of R1 norm, it is known that

1.And

2.

3.

From the above, it can be seen that R1 norm meets the fundamental property of norm, therefore R1 norm is also a kind of norm.

R₁- PCA is defined as follows:

Wherein X indicates that character features extract matrix, and U indicates axis of projection.

Therefrom it can be seen that objective function is that have rotational invariance.At the same time, the institute of the robustness advantage of L1-PCA It is because the summation in formula (1) is the R1 mould with the model solution projection error of L1 norm to be retained.

It constructs Lagrangian and solves (1):

Wherein U indicates that axis of projection, Λ indicate Lagrange multiplier matrix.

Due to matrix U U^TIt is symmetrical, therefore Λ ∈ R^d×dIt is a symmetrical matrix.

LagrangianL (U, Λ) is sought into partial derivative to U and it is enabled to be equal to 0, then (1) optimal solution meets KKT equation:

Wherein C_wFor the weighting covariance matrix of R1-PCA:

It follows that and if only if U be Span (ξ₁,…,ξ_d) any group of normal orthogonal substrate when, (1) formula reaches minimum It is worth, wherein ξ_iIt is C_wI-th of maximum eigenvalue corresponding to normal orthogonal feature vector.

Covariance matrix depends on U, different from L1-PCA, and closed solutions are not present in (1).So directly for C_wCarry out feature Value decomposition is infeasible.Chris Ding proposes orthogonal iteration algorithm.Detailed process:

1) U ∈ R is initialized^m×d, and calculate residual

2) U is updated by following formula:

M=C_wU,

3) iteration executes step 2), and until convergence, iteration terminates.

It is to use R1_PCA that can reduce exceptional value interference, the conjunction with robustness as character features dimension reduction method above Rationality, and the method for solving R1_PCA are summarized.

After combining the character features of R1_PCA dimensionality reduction technology to extract and dimensionality reduction, then classifier design, training are carried out, it is right One character image, extracts feature, loses to classifier, and classifier just classifies to it, tells your this feature the identification At which text.The design of classifier is exactly our task.The design method of classifier generally has: template matching method, differentiation Function method, neural network classification method, Process Based method etc., are not unfolded to describe here.Before carrying out practical identification, toward contact Classifier is trained, this is the process of a supervised learning.Mature classifier also has very much, there is SVM, CNN etc..

It is finally post-processed, namely the classification results of classifier is optimized, this will generally be related to natural language The scope of understanding.Be the processing of nearly word form first: lifting chestnut, " dividing " and " ", shape was close, but if encounter " score " this A word should not just be identified as " counting ", because " score " is only a normal word.This need by language model come into Row is corrected.Followed by for the processing of text composition: such as some books are to be divided to or so two columns, with a line two column of left and right not Belong to in short, any phraseological connection is not present.It, will be the end of left lateral and opening for right lateral if cut according to row Head connects together, this is that we are not intended to see, such situation needs to carry out specially treated.

In the another embodiment of the application, the method also includes:

Obtain the pdf document of the file destination；

Identify the second identification text in the pdf document；

In the another embodiment of the application, determining between the first identification text and the second identification text After difference character, the method also includes:

Mark the difference character in the first identification text and the second identification text (referring to Fig. 3)；

In the another embodiment of the application, as shown in figure 4, also providing a kind of character recognition device, comprising:

First obtains module 11, carries out at image for obtaining the scanning file of file destination, and to the scanning file Reason；

First identification module 12, target image for being obtained using optical character identification OCR technique to image procossing into Line character identification, obtains the first identification text；Wherein, when carrying out character recognition using the OCR technique, R1_PCA pairs is used Character feature in the target image carries out dimensionality reduction.

In the another embodiment of the application, distance degree of the R1_PCA using the first power of R1 norm as loss function Amount:

In the another embodiment of the application, described device further include:

In the another embodiment of the application, a kind of character recognition device, including memory, processor are also provided, it is described The computer program that can be run on the processor is stored in memory, when the processor executes the computer program The step of realizing method described in above method embodiment.

In the another embodiment of the application, a kind of non-volatile program code that can be performed with processor is also provided Computer-readable medium, said program code make the processor execute method described in preceding method embodiment.

It should be noted that, in this document, the relational terms of such as " first " and " second " or the like are used merely to one A entity or operation with another entity or operate distinguish, without necessarily requiring or implying these entities or operation it Between there are any actual relationship or orders.Moreover, the terms "include", "comprise" or its any other variant are intended to Cover non-exclusive inclusion, so that the process, method, article or equipment for including a series of elements not only includes those Element, but also including other elements that are not explicitly listed, or further include for this process, method, article or setting Standby intrinsic element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that There is also other identical elements in the process, method, article or apparatus that includes the element.

The above is only a specific embodiment of the invention, is made skilled artisans appreciate that or realizing this hair It is bright.Various modifications to these embodiments will be apparent to one skilled in the art, as defined herein General Principle can be realized in other embodiments without departing from the spirit or scope of the present invention.Therefore, of the invention It is not intended to be limited to the embodiments shown herein, and is to fit to and applied principle and features of novelty phase one herein The widest scope of cause.

Claims

1. a kind of character identifying method characterized by comprising

Character recognition is carried out to the target image that image procossing obtains using optical character identification OCR technique, obtains the first identification Text；Wherein, special to the character in the target image using R1_PCA when carrying out character recognition using the OCR technique Sign carries out dimensionality reduction.

2. character identifying method according to claim 1, which is characterized in that the R1_PCA is made with the first power of R1 norm For the distance metric of loss function:

Wherein, X ∈ R^m×nIndicate that character features extract matrix, U ∈ R^m×dIndicate axis of projection, V=U^TX indicates that the text after dimensionality reduction is special Levy matrix.

3. character identifying method according to claim 1, which is characterized in that the method also includes:

Obtain the pdf document of the file destination；

Identify the second identification text in the pdf document；

By the second identification text compared with the first identification text, determine that the first identification text and described second is known Difference character between other text.

4. character identifying method according to claim 3, which is characterized in that the method also includes:

And/or the difference character in the second identification text is replaced using the difference character in the first identification text；

And/or the difference character in the first identification text is replaced using the difference character in the second identification text.

5. a kind of character recognition device characterized by comprising

First identification module, for carrying out character to the target image that image procossing obtains using optical character identification OCR technique Identification, obtains the first identification text；Wherein, when carrying out character recognition using the OCR technique, using R1_PCA to the mesh Character feature in logo image carries out dimensionality reduction.

6. character recognition device according to claim 5, which is characterized in that the R1_PCA is made with the first power of R1 norm For the distance metric of loss function:

7. character recognition device according to claim 5, which is characterized in that described device further include:

Comparison module, for the second identification text compared with the first identification text, to be determined the first identification text This identifies the difference character between text described second.

8. character recognition device according to claim 7, which is characterized in that described device further include:

And/or first replacement module, for replacing the second identification text using the difference character in the first identification text Difference character in this；

And/or second replacement module, for replacing the first identification text using the difference character in the second identification text Difference character in this.

9. a kind of character recognition device, including memory, processor, it is stored with and can transports on the processor in the memory Capable computer program, which is characterized in that the processor realizes the claims 1 to 4 when executing the computer program The step of described in any item methods.

10. a kind of computer-readable medium for the non-volatile program code that can be performed with processor, which is characterized in that described Program code makes the processor execute described any the method for claim 1-4.