CN102722736A - Method for splitting and identifying character strings at complex interference - Google Patents

Method for splitting and identifying character strings at complex interference Download PDF

Info

Publication number
CN102722736A
CN102722736A CN2012101932466A CN201210193246A CN102722736A CN 102722736 A CN102722736 A CN 102722736A CN 2012101932466 A CN2012101932466 A CN 2012101932466A CN 201210193246 A CN201210193246 A CN 201210193246A CN 102722736 A CN102722736 A CN 102722736A
Authority
CN
China
Prior art keywords
bag
character
svm classifier
characteristic
classifier device
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2012101932466A
Other languages
Chinese (zh)
Inventor
汪荣贵
戴经成
周良
李想
游生福
查炜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hefei University of Technology
Original Assignee
Hefei University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hefei University of Technology filed Critical Hefei University of Technology
Priority to CN2012101932466A priority Critical patent/CN102722736A/en
Publication of CN102722736A publication Critical patent/CN102722736A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention discloses a method for splitting and identifying character strings at complex interferences, which is characterized by comprising a learning phase and an identifying phase. The learning phase comprises the following steps of: splitting an image containing m characters into m pictures to form a multi-example learning packet, taking the same characters as a category, and classifying the packet and inputting to a base; and calculating an integrogram of the packet, extracting haar-like characteristics of the packet as an example of the packet, finding out key examples of each category by using a diversity density algorithm, and finally learning the key examples by using classified performances of an SVM (Support Vector Machine). The identifying phase comprises the following steps of: predicting the type of a new packet by using a learning result to identify the character strings. By using the method, the function of automatically identifying the character strings at the complex interferences can be realized, and the identifying speed and efficiency are higher.

Description

Character string cuts apart and recognition methods under the complicated disturbance
Technical field
The present invention relates to image processing field, character string cuts apart and recognition technology under specifically a kind of complicated disturbance.
Background technology
OCR (Optical Character Recognition; Be called for short OCR) through years of development; Obtained great advance, discerned automatically at present, the autoscan and the identification and other fields of text be widely used at handwriting input, car plate.Yet existing OCR technology also is difficult to the character string under the complicated disturbance is carried out sane cutting apart and identification.Just because of this, adopt the character string that receives certain interference on the network usually, differentiate that certain operation is the artificial behavior or the automatism of computing machine as identifying code.
At present, the recognition methods of character string mainly is divided into two big types, a kind of method that is based on the theorem in Euclid space distance, template matches for example, PCA, 2D-PCA, Hu invariant moments etc.; These class methods are simple, be easy to realize, and the regular character of shape is had recognition effect preferably; But for the character string under the complicated disturbance, recognition effect is very poor.
Character string under the complicated disturbance generally has following characteristic:
(1) their characters of every type all have multiple font, and deliberately twist or rotate to an angle;
(2) character is sticked together, and is not easily distinguishable;
(3) interference is not obviously distinguished with the characteristic of character itself.
Through increasing the method for study template,, understand the too much temporal expense that is increased in though improved discrimination to a certain extent.
Another kind is based on the recognition methods of supervision machine learning, neural network for example, SVM, AdaBoost algorithm etc.These class methods have the ability of machine learning, can count the characteristic of sample automatically, have high recognition and recognition speed faster; But these class methods need the sample of non-ambiguity, for noisy sample, can not get good results of learning, so lower to the discrimination of the character string under the complicated disturbance.
For removing the character string of disturbing with algorithm, above-mentioned two kinds of methods all need the artificial removal to disturb and make great amount of samples.It is not only time-consuming but also require great effort doing like this.
Summary of the invention
The present invention is for avoiding above-mentioned existing in prior technology weak point; Proposed cutting apart and recognition methods of character string under a kind of complicated disturbance; Can realize obtaining automatically of sample; Under the situation that guarantees discrimination, temporal expense can be do not increased,, good results of learning and discrimination can be got for noisy sample.
Technical solution problem of the present invention adopts following technical scheme:
The characteristics of cutting apart with recognition methods of character string are to be undertaken by following process under a kind of complicated disturbance of the present invention:
I, learning phase: utilize the method for many example machine study as follows character string under the complicated disturbance to be learnt;
Step 1, obtain each bags of many learn-by-examples;
To comprise a noisy m character picture and be cut into m part picture; Each part picture comprises and only comprises a complete character; With m the bag of said m part picture, put in storage respectively with said m bag as many learn-by-examples; Said warehouse-in respectively is meant same character and is placed in the same file as same type, obtain with type the corresponding to n of a quantity file, said n is not more than m;
Step 2, utilize haar-like characteristic prototype to extract the example of haar-like characteristic as bag;
If said image is not a gray level image, then earlier each bag in the storehouse is carried out gray processing and handle, calculate the integrogram of bag again by formula (1); If said image is a gray level image, then utilize formula (1) to calculate the integrogram ii of bag:
ii ( x , y ) = Σ i ≤ x , j ≤ y img ( i , j ) - - - ( 1 )
Ii (x, y) horizontal ordinate i≤x in the presentation video, all pixel sums of ordinate j≤y in the formula (1);
Adopt haar-like characteristic prototype in the integrogram of said bag, to extract the example of haar-like characteristic as bag; The example of said bag is by vector representation, the eigenwert that corresponding each the haar-like characteristic prototype of each component of said vector is extracted;
Step 3, utilize diversity density algorithm to find the crucial example of preceding u maximum example
Figure BDA00001758006900022
of wrapping of the diversity density of each type in the storehouse as such;
Step 4, the crucial example of each type is classified as svm classifier device sample; Number n according to class trains n svm classifier device, and said n svm classifier device is configured to a binary decision tree, and each svm classifier device is each node of said binary decision tree; Said each svm classifier device obtains as follows;
With the positive sample of crucial example a certain type in the storehouse as the study of svm classifier device: choose u crucial example in
Figure BDA00001758006900023
all crucial examples from the storehouse in other type arbitrarily, as the negative sample of svm classifier device study: obtains the svm classifier device f (t) that formula (7) is characterized with said positive sample and negative sample through the SVM algorithm training:
f(t)=sgn(<W *,t>+b *) (2)
In the formula (2), b *Be preset threshold, t is a learning sample to be classified, W *Be weight vectors;
II, cognitive phase: the svm classifier device f (t) that utilizes said many example machine learning method to obtain discerns character string under the complicated disturbance;
With the character boundary of minimum as preliminary sweep rectangular characteristic window, from left to right, more from top to bottom, scan image each picture after cutting; Calculate the rectangular characteristic value in the scanning window, obtain proper vector T, with said proper vector T from said binary decision tree top according to from top to bottom each node of order substitution, utilize formula (3) to calculate and obtain f (T):
f(T)=sgn(<W *,T>+b *) (3)
Greater than 0 o'clock, classification finished up to f (T), the output recognition result; If proper vector T brings last node of said binary decision tree into calculate after, f (T) is still less than 0, then with said preliminary sweep rectangular characteristic window with after the fixing multiplying power expansion; Each node of bringing said binary decision tree top again into calculates, until f (T)>0, then classification finishes; The output recognition result; If when the expanded in size of said preliminary sweep rectangular characteristic window arrives greater than maximum character, still do not find f (T)>0, then represent recognition failures.
Compared with present technology, beneficial effect of the present invention is embodied in:
1, the present invention utilizes the method for many learn-by-examples to realize obtaining automatically of sample to the character string picture under the complicated disturbance, need not manually to make sample, has improved work efficiency.
2, the present invention uses the method for many learn-by-examples, the complex characters string that contains noise and interference is had results of learning preferably, thereby can reach high recognition, and under the situation that increases learning sample, also can not increase the expense on the recognition time.
3, the present invention uses the integrogram computing, can obtain the haar-like characteristic of image faster.
4, the present invention utilizes haar-like feature extraction sample characteristics, is applicable to the feature extraction of the character string under any complicated disturbance to have good universality.
5, the present invention utilizes the mode of multiple dimensioned scanning, and the character string that character boundary is changed has better robustness and recognition effect.
Description of drawings
Fig. 1 obtains synoptic diagram for learning sample of the present invention;
Fig. 2 constructs many svm classifiers device method synoptic diagram for the present invention;
Fig. 3 is a character string identifying synoptic diagram of the present invention.
Embodiment
Undertaken by following process cutting apart with recognition methods of character string under the present embodiment complicated disturbance:
I, learning phase: utilize the method for many example machine study as follows character string under the complicated disturbance to be learnt;
Step 1, obtain each bags of many learn-by-examples;
To comprise noisy m character picture automatic segmentation and become m part picture; Each part picture comprises and only comprises a complete character; With m part picture each bag, thereby constitute the learning sample of many case-based learnings, with each bag warehouse-in respectively as many learn-by-examples; Warehouse-in is meant same character and is placed in the same file as same type respectively, obtain with type the corresponding to n of a quantity file, n is not more than m.
As shown in Figure 1 in the practical implementation; The image that will include four characters and interference is cut into four parts of character pictures; Each part picture is all unique to comprise a complete character; Be respectively character A, character A, character 4 and character Q, with four character pictures as each bags of many learn-by-examples and warehouse-in respectively.For example, the picture that will include character A and character A is put into the file that the identical file folder is called A as same class, and is as shown in fig. 1.Here need not artificial removal's lines and disturb, improved the efficient of making learning sample;
Step 2, utilize haar-like characteristic prototype to extract the example of haar-like characteristic as bag;
In order to accelerate to extract the speed of haar-like characteristic, calculate the integrogram of each bag in the storehouse in the present embodiment earlier.If image is not a gray level image, then earlier each bag in the storehouse is carried out gray processing and handle, calculate the integrogram of bag again by formula (1); If image is a gray level image, then directly calculate the integrogram of bag by formula (1):
ii ( x , y ) = &Sigma; i &le; x , j &le; y img ( i , j ) - - - ( 1 )
Image (x, the integrogram ii that y) locates (x, y) horizontal ordinate i≤x in the presentation video, all pixel sums of ordinate j≤y in the formula (1); In the present embodiment, (x is to be initial point O with the image top left corner apex y) to image coordinate, is the X axle with the horizontal direction, is the determined coordinate of plane right-angle coordinate XOY that the Y axle is set up with the vertical direction.(x, (x is that (x, (i, j) (x y) locates any pixel in the upper left corner to presentation video to img in y) all pixel value sums of the upper left corner, formula (1) in position among the image img y) to the integrogram ii that y) locates in position among the image img.Ii is the integrogram of bag, adopts haar-like characteristic prototype in the integrogram of bag, to extract the example of haar-like characteristic as bag; The example of bag is by vector representation, the eigenwert that corresponding each the haar-like characteristic prototype of each component of vector is extracted; If haar-like characteristic prototype is not enough to describe character feature, then haar-like characteristic prototype is expanded in rotary manner.Adopt the haar-like characteristic prototype of expansion in the integrogram of bag, to extract the example of the haar-like characteristic of expansion again as bag.
Step 3, utilize diversity density algorithm to find the crucial example of preceding u maximum example
Figure BDA00001758006900042
of wrapping of the diversity density of each type in the storehouse as such; In the present embodiment, the crucial example
Figure BDA00001758006900043
of same type of character A is respectively:
t 1 * = ( 0,2,218,212,34,231,24,32,12,13,12,45,15 ) ,
t 2 * = ( 12,41,243,221,19,251,13,28,46,32,20,21,22 ) ,
t 3 * = ( 4,21,223,233,16,242 , 18,27,31,22,24,35,31 ) ,
t 4 * = ( 13,23,225,241,8,229,17,16,24,10,16,28,16 ) ,
t 5 * = ( 21,0,241,220,4,227,16,10,9,3,18,40,29 ) , The number u=5 of crucial example here.
Step 4, the crucial example of each type is classified as svm classifier device sample; Number n according to class trains n svm classifier device, and n svm classifier device is configured to a binary decision tree, and each svm classifier device is each node of binary decision tree.As shown in Figure 2, be the root node of binary decision tree with svm classifier device 1, its left child is the first kind character of sorter 1 correspondence, right child's svm classifier device 2, the left child of svm classifier device 2 is its second type of corresponding character; The left child of i sorter svm classifier device i is its corresponding i class character, and right child is an i+1 svm classifier device; In the practical implementation, type number n be the character kind discerned as required and definite; Each svm classifier device obtains as follows;
With the positive sample of crucial example a certain type in the storehouse as the study of svm classifier device: choose u crucial example in
Figure BDA00001758006900051
all crucial examples from the storehouse in other type arbitrarily, as the negative sample of svm classifier device study:
Figure BDA00001758006900052
obtains the svm classifier device f (t) that formula (7) is characterized with positive sample and negative sample through the SVM algorithm training:
f(t)=sgn(<W *,t>+b *) (2)
In the formula (2), b *Be preset threshold, t is a learning sample to be classified, W *Be weight vectors; Then each node in the binary decision tree is the expressed svm classifier device of formula (2);
II, cognitive phase: utilize the learning outcome of many example machine learning method that character string under the complicated disturbance is discerned;
As shown in Figure 3; After cognitive phase begins; The character string picture that at first will discern comprises and only comprises a complete character cutting and becomes 4 parts according to each part picture, again the picture after the cutting is done pre-service, comprises that gray processing is handled, the integrogram of calculating bag; Then with the character boundary of minimum as preliminary sweep rectangular characteristic window, from left to right, more from top to bottom, scan image each picture after cutting; Calculate the rectangular characteristic value in the scanning window, obtain proper vector T, proper vector T is calculated f (T) from the binary decision tree top according to from top to bottom each node of order substitution by following formula (3):
f(T)=sgn(<W *,T>+b *) (3)
If f (T)>0, then show it is the pairing character of this sorter, export this recognition result, otherwise the character in the explanation scanning window is all the other types character, then substitution next node calculates f (T), and greater than 0 o'clock, classification finished up to f (T), the output recognition result; If proper vector T brings last node of binary decision tree into calculate after, f (T) is still less than 0, then with preliminary sweep rectangular characteristic window with after the fixing multiplying power expansion; Each node of bringing the binary decision tree top again into calculates, until f (T)>0, then classification finishes; The output recognition result; If when the expanded in size of preliminary sweep rectangular characteristic window arrives greater than maximum character, still do not find f (T)>0, then represent recognition failures; Need the character picture of this recognition failures be put in storage again, begin to carry out the study of a new round from step 2.

Claims (1)

1. the cutting apart and recognition methods of character string under the complicated disturbance, its characteristic is undertaken by following process:
I, learning phase: utilize the method for many example machine study as follows character string under the complicated disturbance to be learnt;
Step 1, obtain each bags of many learn-by-examples;
To comprise a noisy m character picture and be cut into m part picture; Each part picture comprises and only comprises a complete character; With m the bag of said m part picture, put in storage respectively with said m bag as many learn-by-examples; Said warehouse-in respectively is meant same character and is placed in the same file as same type, obtain with type the corresponding to n of a quantity file, said n is not more than m;
Step 2, utilize haar-like characteristic prototype to extract the example of haar-like characteristic as bag;
If said image is not a gray level image, then earlier each bag in the storehouse is carried out gray processing and handle, calculate the integrogram of bag again by formula (1); If said image is a gray level image, then utilize formula (1) to calculate the integrogram ii of bag:
ii ( x , y ) = &Sigma; i &le; x , j &le; y img ( i , j ) - - - ( 1 )
Ii (x, y) horizontal ordinate i≤x in the presentation video, all pixel sums of ordinate j≤y in the formula (1);
Adopt haar-like characteristic prototype in the integrogram of said bag, to extract the example of haar-like characteristic as bag; The example of said bag is by vector representation, the eigenwert that corresponding each the haar-like characteristic prototype of each component of said vector is extracted;
Step 3, utilize diversity density algorithm to find the crucial example of preceding u maximum example
Figure FDA00001758006800012
of wrapping of the diversity density of each type in the storehouse as such;
Step 4, the crucial example of each type is classified as svm classifier device sample; Number n according to class trains n svm classifier device, and said n svm classifier device is configured to a binary decision tree, and each svm classifier device is each node of said binary decision tree; Said each svm classifier device obtains as follows;
With the positive sample of crucial example a certain type in the storehouse as the study of svm classifier device: choose u crucial example in all crucial examples from the storehouse in other type arbitrarily, as the negative sample of svm classifier device study:
Figure FDA00001758006800014
obtains the svm classifier device f (t) that formula (7) is characterized with said positive sample and negative sample through the SVM algorithm training:
f(t)=sgn(<W *,t>+b *) (2)
In the formula (2), b *Be preset threshold, t is a learning sample to be classified, W *Be weight vectors;
II, cognitive phase: the svm classifier device f (t) that utilizes said many example machine learning method to obtain discerns character string under the complicated disturbance;
With the character boundary of minimum as preliminary sweep rectangular characteristic window, from left to right, more from top to bottom, scan image each picture after cutting; Calculate the rectangular characteristic value in the scanning window, obtain proper vector T, with said proper vector T from said binary decision tree top according to from top to bottom each node of order substitution, utilize formula (3) to calculate and obtain f (T):
f(T)=sgn(<W *,T>+b *) (3)
Greater than 0 o'clock, classification finished up to f (T), the output recognition result; If proper vector T brings last node of said binary decision tree into calculate after, f (T) is still less than 0, then with said preliminary sweep rectangular characteristic window with after the fixing multiplying power expansion; Each node of bringing said binary decision tree top again into calculates, until f (T)>0, then classification finishes; The output recognition result; If when the expanded in size of said preliminary sweep rectangular characteristic window arrives greater than maximum character, still do not find f (T)>0, then represent recognition failures.
CN2012101932466A 2012-06-13 2012-06-13 Method for splitting and identifying character strings at complex interference Pending CN102722736A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2012101932466A CN102722736A (en) 2012-06-13 2012-06-13 Method for splitting and identifying character strings at complex interference

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2012101932466A CN102722736A (en) 2012-06-13 2012-06-13 Method for splitting and identifying character strings at complex interference

Publications (1)

Publication Number Publication Date
CN102722736A true CN102722736A (en) 2012-10-10

Family

ID=46948485

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2012101932466A Pending CN102722736A (en) 2012-06-13 2012-06-13 Method for splitting and identifying character strings at complex interference

Country Status (1)

Country Link
CN (1) CN102722736A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103226711A (en) * 2013-03-28 2013-07-31 四川长虹电器股份有限公司 Quick Haar wavelet feature object detecting method
CN103632380A (en) * 2013-11-01 2014-03-12 华南理工大学 On-line game playing card identification method based on key point decision trees
CN104778457A (en) * 2015-04-18 2015-07-15 吉林大学 Video face identification algorithm on basis of multi-instance learning
CN106445988A (en) * 2016-06-01 2017-02-22 上海坤士合生信息科技有限公司 Intelligent big data processing method and system
CN107292302A (en) * 2016-03-31 2017-10-24 高德信息技术有限公司 Detect the method and system of point of interest in picture

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101110103A (en) * 2006-07-20 2008-01-23 中国科学院自动化研究所 Image registration self-verifying method based on learning
JP2009048641A (en) * 2007-08-20 2009-03-05 Fujitsu Ltd Character recognition method and character recognition device
CN101937508A (en) * 2010-09-30 2011-01-05 湖南大学 License plate localization and identification method based on high-definition image
CN102163287A (en) * 2011-03-28 2011-08-24 北京邮电大学 Method for recognizing characters of licence plate based on Haar-like feature and support vector machine

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101110103A (en) * 2006-07-20 2008-01-23 中国科学院自动化研究所 Image registration self-verifying method based on learning
JP2009048641A (en) * 2007-08-20 2009-03-05 Fujitsu Ltd Character recognition method and character recognition device
CN101937508A (en) * 2010-09-30 2011-01-05 湖南大学 License plate localization and identification method based on high-definition image
CN102163287A (en) * 2011-03-28 2011-08-24 北京邮电大学 Method for recognizing characters of licence plate based on Haar-like feature and support vector machine

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103226711A (en) * 2013-03-28 2013-07-31 四川长虹电器股份有限公司 Quick Haar wavelet feature object detecting method
CN103632380A (en) * 2013-11-01 2014-03-12 华南理工大学 On-line game playing card identification method based on key point decision trees
CN103632380B (en) * 2013-11-01 2017-04-12 华南理工大学 On-line game playing card identification method based on key point decision trees
CN104778457A (en) * 2015-04-18 2015-07-15 吉林大学 Video face identification algorithm on basis of multi-instance learning
CN104778457B (en) * 2015-04-18 2017-12-01 吉林大学 Video face identification method based on multi-instance learning
CN107292302A (en) * 2016-03-31 2017-10-24 高德信息技术有限公司 Detect the method and system of point of interest in picture
CN106445988A (en) * 2016-06-01 2017-02-22 上海坤士合生信息科技有限公司 Intelligent big data processing method and system

Similar Documents

Publication Publication Date Title
Quan et al. Lacunarity analysis on image patterns for texture classification
JP5041229B2 (en) Learning device and method, recognition device and method, and program
CN108108731B (en) Text detection method and device based on synthetic data
CN102722736A (en) Method for splitting and identifying character strings at complex interference
CN102147858B (en) License plate character identification method
CN105469047A (en) Chinese detection method based on unsupervised learning and deep learning network and system thereof
CN105488536A (en) Agricultural pest image recognition method based on multi-feature deep learning technology
CN103942550A (en) Scene text recognition method based on sparse coding characteristics
CN104766098A (en) Construction method for classifier
Yu et al. Vehicle logo recognition based on bag-of-words
CN105574063A (en) Image retrieval method based on visual saliency
CN102831244B (en) A kind of classification retrieving method of house property file and picture
CN107730553B (en) Weak supervision object detection method based on false-true value search method
CN104834891A (en) Method and system for filtering Chinese character image type spam
Mirrashed et al. Domain adaptive classification
Mozaffari et al. Farsi/Arabic handwritten from machine-printed words discrimination
CN104598881A (en) Feature compression and feature selection based skew scene character recognition method
CN105718934A (en) Method for pest image feature learning and identification based on low-rank sparse coding technology
JP2016151805A (en) Object detection apparatus, object detection method, and program
Sarkar et al. Suppression of non-text components in handwritten document images
Gattal et al. Segmentation and recognition strategy of handwritten connected digits based on the oriented sliding window
CN107688744A (en) Malicious file sorting technique and device based on Image Feature Matching
CN111898570A (en) Method for recognizing text in image based on bidirectional feature pyramid network
Nongmeikapam et al. Exploring an efficient handwritten Manipuri meetei-mayek character recognition using gradient feature extractor and cosine distance based multiclass k-nearest neighbor classifier
CN104008095A (en) Object recognition method based on semantic feature extraction and matching

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20121010