CN102722736A

CN102722736A - Method for splitting and identifying character strings at complex interference

Info

Publication number: CN102722736A
Application number: CN2012101932466A
Authority: CN
Inventors: 汪荣贵; 戴经成; 周良; 李想; 游生福; 查炜
Original assignee: Hefei University of Technology
Current assignee: Hefei University of Technology
Priority date: 2012-06-13
Filing date: 2012-06-13
Publication date: 2012-10-10

Abstract

The invention discloses a method for splitting and identifying character strings at complex interferences, which is characterized by comprising a learning phase and an identifying phase. The learning phase comprises the following steps of: splitting an image containing m characters into m pictures to form a multi-example learning packet, taking the same characters as a category, and classifying the packet and inputting to a base; and calculating an integrogram of the packet, extracting haar-like characteristics of the packet as an example of the packet, finding out key examples of each category by using a diversity density algorithm, and finally learning the key examples by using classified performances of an SVM (Support Vector Machine). The identifying phase comprises the following steps of: predicting the type of a new packet by using a learning result to identify the character strings. By using the method, the function of automatically identifying the character strings at the complex interferences can be realized, and the identifying speed and efficiency are higher.

Description

Character string cuts apart and recognition methods under the complicated disturbance

Technical field

The present invention relates to image processing field, character string cuts apart and recognition technology under specifically a kind of complicated disturbance.

Background technology

OCR (Optical Character Recognition; Be called for short OCR) through years of development; Obtained great advance, discerned automatically at present, the autoscan and the identification and other fields of text be widely used at handwriting input, car plate.Yet existing OCR technology also is difficult to the character string under the complicated disturbance is carried out sane cutting apart and identification.Just because of this, adopt the character string that receives certain interference on the network usually, differentiate that certain operation is the artificial behavior or the automatism of computing machine as identifying code.

At present, the recognition methods of character string mainly is divided into two big types, a kind of method that is based on the theorem in Euclid space distance, template matches for example, PCA, 2D-PCA, Hu invariant moments etc.; These class methods are simple, be easy to realize, and the regular character of shape is had recognition effect preferably; But for the character string under the complicated disturbance, recognition effect is very poor.

Character string under the complicated disturbance generally has following characteristic:

(1) their characters of every type all have multiple font, and deliberately twist or rotate to an angle;

(2) character is sticked together, and is not easily distinguishable;

(3) interference is not obviously distinguished with the characteristic of character itself.

Through increasing the method for study template,, understand the too much temporal expense that is increased in though improved discrimination to a certain extent.

Another kind is based on the recognition methods of supervision machine learning, neural network for example, SVM, AdaBoost algorithm etc.These class methods have the ability of machine learning, can count the characteristic of sample automatically, have high recognition and recognition speed faster; But these class methods need the sample of non-ambiguity, for noisy sample, can not get good results of learning, so lower to the discrimination of the character string under the complicated disturbance.

For removing the character string of disturbing with algorithm, above-mentioned two kinds of methods all need the artificial removal to disturb and make great amount of samples.It is not only time-consuming but also require great effort doing like this.

Summary of the invention

The present invention is for avoiding above-mentioned existing in prior technology weak point; Proposed cutting apart and recognition methods of character string under a kind of complicated disturbance; Can realize obtaining automatically of sample; Under the situation that guarantees discrimination, temporal expense can be do not increased,, good results of learning and discrimination can be got for noisy sample.

Technical solution problem of the present invention adopts following technical scheme:

The characteristics of cutting apart with recognition methods of character string are to be undertaken by following process under a kind of complicated disturbance of the present invention:

I, learning phase: utilize the method for many example machine study as follows character string under the complicated disturbance to be learnt;

Step 1, obtain each bags of many learn-by-examples;

To comprise a noisy m character picture and be cut into m part picture; Each part picture comprises and only comprises a complete character; With m the bag of said m part picture, put in storage respectively with said m bag as many learn-by-examples; Said warehouse-in respectively is meant same character and is placed in the same file as same type, obtain with type the corresponding to n of a quantity file, said n is not more than m;

Step 2, utilize haar-like characteristic prototype to extract the example of haar-like characteristic as bag;

If said image is not a gray level image, then earlier each bag in the storehouse is carried out gray processing and handle, calculate the integrogram of bag again by formula (1); If said image is a gray level image, then utilize formula (1) to calculate the integrogram ii of bag:

ii (x, y) = \underset{i \leq x, j \leq y}{Σ} img (i, j) - - - (1)

Ii (x, y) horizontal ordinate i≤x in the presentation video, all pixel sums of ordinate j≤y in the formula (1);

Adopt haar-like characteristic prototype in the integrogram of said bag, to extract the example of haar-like characteristic as bag; The example of said bag is by vector representation, the eigenwert that corresponding each the haar-like characteristic prototype of each component of said vector is extracted;

Step 3, utilize diversity density algorithm to find the crucial example of preceding u maximum example

of wrapping of the diversity density of each type in the storehouse as such;

Step 4, the crucial example of each type is classified as svm classifier device sample; Number n according to class trains n svm classifier device, and said n svm classifier device is configured to a binary decision tree, and each svm classifier device is each node of said binary decision tree; Said each svm classifier device obtains as follows;

With the positive sample of crucial example a certain type in the storehouse as the study of svm classifier device: choose u crucial example in

all crucial examples from the storehouse in other type arbitrarily, as the negative sample of svm classifier device study: obtains the svm classifier device f (t) that formula (7) is characterized with said positive sample and negative sample through the SVM algorithm training:

f(t)=sgn(<W ^*,t>+b ^*) （2）

In the formula (2), b ^*Be preset threshold, t is a learning sample to be classified, W ^*Be weight vectors;

II, cognitive phase: the svm classifier device f (t) that utilizes said many example machine learning method to obtain discerns character string under the complicated disturbance;

With the character boundary of minimum as preliminary sweep rectangular characteristic window, from left to right, more from top to bottom, scan image each picture after cutting; Calculate the rectangular characteristic value in the scanning window, obtain proper vector T, with said proper vector T from said binary decision tree top according to from top to bottom each node of order substitution, utilize formula (3) to calculate and obtain f (T):

f(T)=sgn(<W ^*,T>+b ^*) （3）

Greater than 0 o'clock, classification finished up to f (T), the output recognition result; If proper vector T brings last node of said binary decision tree into calculate after, f (T) is still less than 0, then with said preliminary sweep rectangular characteristic window with after the fixing multiplying power expansion; Each node of bringing said binary decision tree top again into calculates, until f (T)>0, then classification finishes; The output recognition result; If when the expanded in size of said preliminary sweep rectangular characteristic window arrives greater than maximum character, still do not find f (T)>0, then represent recognition failures.

Compared with present technology, beneficial effect of the present invention is embodied in:

1, the present invention utilizes the method for many learn-by-examples to realize obtaining automatically of sample to the character string picture under the complicated disturbance, need not manually to make sample, has improved work efficiency.

2, the present invention uses the method for many learn-by-examples, the complex characters string that contains noise and interference is had results of learning preferably, thereby can reach high recognition, and under the situation that increases learning sample, also can not increase the expense on the recognition time.

3, the present invention uses the integrogram computing, can obtain the haar-like characteristic of image faster.

4, the present invention utilizes haar-like feature extraction sample characteristics, is applicable to the feature extraction of the character string under any complicated disturbance to have good universality.

5, the present invention utilizes the mode of multiple dimensioned scanning, and the character string that character boundary is changed has better robustness and recognition effect.

Description of drawings

Fig. 1 obtains synoptic diagram for learning sample of the present invention;

Fig. 2 constructs many svm classifiers device method synoptic diagram for the present invention;

Fig. 3 is a character string identifying synoptic diagram of the present invention.

Embodiment

Undertaken by following process cutting apart with recognition methods of character string under the present embodiment complicated disturbance:

Step 1, obtain each bags of many learn-by-examples;

To comprise noisy m character picture automatic segmentation and become m part picture; Each part picture comprises and only comprises a complete character; With m part picture each bag, thereby constitute the learning sample of many case-based learnings, with each bag warehouse-in respectively as many learn-by-examples; Warehouse-in is meant same character and is placed in the same file as same type respectively, obtain with type the corresponding to n of a quantity file, n is not more than m.

As shown in Figure 1 in the practical implementation; The image that will include four characters and interference is cut into four parts of character pictures; Each part picture is all unique to comprise a complete character; Be respectively character A, character A, character 4 and character Q, with four character pictures as each bags of many learn-by-examples and warehouse-in respectively.For example, the picture that will include character A and character A is put into the file that the identical file folder is called A as same class, and is as shown in fig. 1.Here need not artificial removal's lines and disturb, improved the efficient of making learning sample;

In order to accelerate to extract the speed of haar-like characteristic, calculate the integrogram of each bag in the storehouse in the present embodiment earlier.If image is not a gray level image, then earlier each bag in the storehouse is carried out gray processing and handle, calculate the integrogram of bag again by formula (1); If image is a gray level image, then directly calculate the integrogram of bag by formula (1):

ii (x, y) = \underset{i \leq x, j \leq y}{Σ} img (i, j) - - - (1)

Image (x, the integrogram ii that y) locates (x, y) horizontal ordinate i≤x in the presentation video, all pixel sums of ordinate j≤y in the formula (1); In the present embodiment, (x is to be initial point O with the image top left corner apex y) to image coordinate, is the X axle with the horizontal direction, is the determined coordinate of plane right-angle coordinate XOY that the Y axle is set up with the vertical direction.(x, (x is that (x, (i, j) (x y) locates any pixel in the upper left corner to presentation video to img in y) all pixel value sums of the upper left corner, formula (1) in position among the image img y) to the integrogram ii that y) locates in position among the image img.Ii is the integrogram of bag, adopts haar-like characteristic prototype in the integrogram of bag, to extract the example of haar-like characteristic as bag; The example of bag is by vector representation, the eigenwert that corresponding each the haar-like characteristic prototype of each component of vector is extracted; If haar-like characteristic prototype is not enough to describe character feature, then haar-like characteristic prototype is expanded in rotary manner.Adopt the haar-like characteristic prototype of expansion in the integrogram of bag, to extract the example of the haar-like characteristic of expansion again as bag.

of wrapping of the diversity density of each type in the storehouse as such; In the present embodiment, the crucial example

of same type of character A is respectively:

t_{1}^{*} = (0,2,218,212,34,231,24,32,12,13,12,45,15),

t_{2}^{*} = (12,41,243,221,19,251,13,28,46,32,20,21,22),

t_{3}^{*} = (4,21,223,233,16,242, 18,27,31,22,24,35,31),

t_{4}^{*} = (13,23,225,241,8,229,17,16,24,10,16,28,16),

t_{5}^{*} = (21,0,241,220,4,227,16,10,9,3,18,40,29),

The number u=5 of crucial example here.

Step 4, the crucial example of each type is classified as svm classifier device sample; Number n according to class trains n svm classifier device, and n svm classifier device is configured to a binary decision tree, and each svm classifier device is each node of binary decision tree.As shown in Figure 2, be the root node of binary decision tree with svm classifier device 1, its left child is the first kind character of sorter 1 correspondence, right child's svm classifier device 2, the left child of svm classifier device 2 is its second type of corresponding character; The left child of i sorter svm classifier device i is its corresponding i class character, and right child is an i+1 svm classifier device; In the practical implementation, type number n be the character kind discerned as required and definite; Each svm classifier device obtains as follows;

all crucial examples from the storehouse in other type arbitrarily, as the negative sample of svm classifier device study:

obtains the svm classifier device f (t) that formula (7) is characterized with positive sample and negative sample through the SVM algorithm training:

f(t)=sgn(<W ^*,t>+b ^*) （2）

In the formula (2), b ^*Be preset threshold, t is a learning sample to be classified, W ^*Be weight vectors; Then each node in the binary decision tree is the expressed svm classifier device of formula (2);

II, cognitive phase: utilize the learning outcome of many example machine learning method that character string under the complicated disturbance is discerned;

As shown in Figure 3; After cognitive phase begins; The character string picture that at first will discern comprises and only comprises a complete character cutting and becomes 4 parts according to each part picture, again the picture after the cutting is done pre-service, comprises that gray processing is handled, the integrogram of calculating bag; Then with the character boundary of minimum as preliminary sweep rectangular characteristic window, from left to right, more from top to bottom, scan image each picture after cutting; Calculate the rectangular characteristic value in the scanning window, obtain proper vector T, proper vector T is calculated f (T) from the binary decision tree top according to from top to bottom each node of order substitution by following formula (3):

f(T)=sgn(<W ^*,T>+b ^*) （3）

If f (T)>0, then show it is the pairing character of this sorter, export this recognition result, otherwise the character in the explanation scanning window is all the other types character, then substitution next node calculates f (T), and greater than 0 o'clock, classification finished up to f (T), the output recognition result; If proper vector T brings last node of binary decision tree into calculate after, f (T) is still less than 0, then with preliminary sweep rectangular characteristic window with after the fixing multiplying power expansion; Each node of bringing the binary decision tree top again into calculates, until f (T)>0, then classification finishes; The output recognition result; If when the expanded in size of preliminary sweep rectangular characteristic window arrives greater than maximum character, still do not find f (T)>0, then represent recognition failures; Need the character picture of this recognition failures be put in storage again, begin to carry out the study of a new round from step 2.

Claims

1. the cutting apart and recognition methods of character string under the complicated disturbance, its characteristic is undertaken by following process:

Step 1, obtain each bags of many learn-by-examples;

ii (x, y) = \underset{i \leq x, j \leq y}{Σ} img (i, j) - - - (1)

of wrapping of the diversity density of each type in the storehouse as such;

With the positive sample of crucial example a certain type in the storehouse as the study of svm classifier device: choose u crucial example in all crucial examples from the storehouse in other type arbitrarily, as the negative sample of svm classifier device study:

obtains the svm classifier device f (t) that formula (7) is characterized with said positive sample and negative sample through the SVM algorithm training:

f(t)=sgn(<W ^*,t>+b ^*) （2）

f(T)=sgn(<W ^*,T>+b ^*) （3）