CN102855479A

CN102855479A - Printed Chinese character recognition system

Info

Publication number: CN102855479A
Application number: CN2012102574590A
Authority: CN
Inventors: 陶军
Original assignee: SUZHOU INDUSTRIAL PARK SEVEN STAR ELECTRONICS CO LTD
Current assignee: SUZHOU INDUSTRIAL PARK SEVEN STAR ELECTRONICS CO LTD
Priority date: 2012-07-24
Filing date: 2012-07-24
Publication date: 2013-01-02

Abstract

The invention relates to a printed Chinese character recognition system. By adopting a novel algorithm in the artificial intelligent field, the Chinese character recognition quality is effectively improved. The system comprises a scanning input for converting Chinese characters printed on a paper surface into electric signals through photoelectric conversion equipment to form a plurality of gray scales of digital signals and inputting the signals into a computer for processing, a fuzzy enhancement and cluster partitioning module for enhancing and smoothing fuzzy and performing optimal partition on multiple gray scales, an image data binarization module for carrying out binaryzation on an image subjected to smoothing, enhancement and multi-gray-scale cluster partitioning by an overall threshold value selection method, counting a gray scale histogram of the image and partitioning the histogram into two groups at a certain threshold value, and determining the threshold value when a variance between the two divided groups is the largest, and a Chinese character recognition module for performing rough classification on a sample through a control network (CN) by a parallel neural network method, and performing fine classification on the rough classifications through a recognition network (RN) to recognize Chinese characters.

Description

Printed Chinese character recognition system

Technical Field

The invention belongs to the technical field of pattern recognition, and particularly relates to a printed Chinese character recognition system adopting a new algorithm in the field of artificial intelligence.

Background

The study of automatic abstracts is an important area of natural language processing. However, all automatic abstract models currently use the built-in representation of characters as system input, and obviously have a considerable gap from the practical goal, because a large amount of literature still exists in the traditional paper printing form. Therefore, a parallel neural network method for identifying the Chinese characters in the printed form is provided and experimental simulation is carried out. In the test of 2500 Chinese characters, the recognition rate is 97%, the false recognition rate is 1% and the rejection rate is 2%. It should be noted that although our system is designed for print Chinese characters, the basic implementation principles and methods are equally applicable to recognition of print English, as well as symbols such as handwritten numbers, English, and Chinese characters.

The system for identifying the Chinese characters in the printed form mainly comprises the following steps: scanner input → fuzzy enhancement and cluster segmentation → image data binarization → Chinese character matching by parallel neural network.

Disclosure of Invention

The invention aims to provide a tunable band-pass filter system which outputs one or more accurate central frequency points, bandwidth control, out-of-band signal suppression and signal gain functions to an input radio frequency digital signal.

The technical scheme for realizing the purpose of the invention is as follows: a printed Chinese character recognition system adopts a new algorithm in the field of artificial intelligence, effectively improves the quality of Chinese character recognition, and comprises:

scanning input, converting the Chinese characters printed on the paper surface into electric signals through a photoelectric conversion device to form digital signals with multiple gray levels, and inputting the digital signals into a computer for processing;

the fuzzy enhancement and clustering segmentation module comprises fuzzy enhancement, smoothing and multi-gray-level optimal segmentation;

the image data binarization module is used for carrying out binarization by adopting an integral threshold value selection method after smoothing and enhancing the image and clustering segmentation of multiple gray levels, firstly counting a gray level histogram of the image, then segmenting the histogram into two groups at a certain threshold value, and determining a threshold value when the variance between the two groups is maximum;

the system adopts a parallel neural network method, roughly classifies samples through a control network CN, and finely classifies various rough classes through an identification network RN, so that Chinese characters are identified.

As a further improvement of the present invention, the fuzzy enhancement and smoothing module adopts a model proposed by s.k.pal, etc., from input to output, and two times uses a fuzzy contrast enhancement operator for fuzzy enhancement processing, and one smoothing operation therebetween, to prepare for the next enhancement.

As a further improvement of the invention, the cluster segmentation performs the following steps:

(1) defining sample interval, taking K, C and R (K < S)3 values, and taking K samples as aggregation points;

(2) calculating the distance between every two of the K aggregation points, if the minimum distance is less than C, merging the corresponding two points, and taking the arithmetic mean of the two points as a new aggregation point; repeating the steps until all the intervals are more than or equal to C;

(3) when another (S-K) samples are examined, the distance between the sample and all the agglutination points is calculated every time one sample is examined, and if the minimum distance is greater than R, the sample is used as a new agglutination point; if the minimum distance < R falls within the class of nearest agglutinated points, then recalculating the center of gravity of this class and using this as a new agglutinated point, and if the distance between agglutinated points > C, examining the next sample; otherwise, step 2 is carried out, after the aggregation points are combined, the next sample is inspected until the classification is finished;

(4) inspecting all the samples one by one, clustering according to the step 3, and if the final classification result is the same as the original result, not calculating the gravity center; otherwise, the gravity center is recalculated, and if the classification result is the same as the original result after one-time investigation, the clustering is finished; otherwise, repeating the step 4 until the classification result is completely the same as the previous classification result;

(5) the above clustering process ends with the state of class number 3, so that the class numbers m can be generated₁，m₂，…，m_e(m_e=3), and the respective target functional can be determined in accordance with equation (1)

J_{d} (m_{d}, C) = Σ_{i = 1}^{md} \underset{j = 1}{Σ} r_{j} - r_{i}^{*} j &Element; C_{i} d = 1,2, \cdot \cdot \cdot, e

(6) Decision J = min { J }₁(m₁，C)，J₂(m₂，C)，…，J_e(m_eThe partition corresponding to C) is the best;

(7) setting thresholds of all levels, setting class C_iU luminances: r is_i1，r_i2，…，r_iuCorresponding density is P₁，P₂，…，P_iuThen threshold value r_i ^**Is composed of

r_{i}^{* *} = Σ_{i = 1}^{u} P_{ij} r_{ij} / \underset{i = 1}{Σ} P_{ij} i = 1,2, \cdot \cdot \cdot, m^{*}

Wherein m is^*Is the number of classes that is optimally partitioned.

As a further improvement of the invention, the image data binarization performs the following steps:

(1) counting the total number of pixels in the image

And the probability of each gray level p (i) = N (i)/N;

(2) dividing the 1, …, M level of image gray into two groups C by a certain gray level k₀= 1 → K and C1= { (K +1) → M }, the probability of each group occurring being respectively

C_{0} : ω_{0} = Σ_{i = 1}^{k} P_{i} = ω (k)

C_{1} : ω_{1} = Σ_{i = k + 1}^{M} P_{i} = 1 - ω (k)

C₀Average value of (2)

μ_{0} = Σ_{i = 1}^{k} (i^{*} P_{i}) / ω_{0} = μ (k) / ω (k)

C₁Average value of (2)

μ_{1} = Σ_{i = k + 1}^{k} (i^{*} P_{i}) / ω_{1} = (μ - μ (k)) / (1 - ω (k))

Where μ = ∑ ip_iIs the average of the gray levels of the overall image;

is the average value of the gray scale when the image threshold is k;

(3) changing the k value from 1 to M to obtain the above formula delta²(k) Is the maximum k, the value k at this time^*Is the threshold value sought

As a further improvement of the invention, a new additional learning algorithm is adopted, and the learning of all Chinese characters is finally realized by simulating the character learning process of human beings, namely, learning part of Chinese characters first and then gradually learning the rest Chinese characters.

As a further refinement of the invention, the supplementary learning algorithm performs the following steps:

(1) let the original PNN recognize all samples in V, and set the correctly recognized sample set as V₀；

(2) The samples in the set V-V0 are divided as follows: VT is V-V₀Sample subset in PNN learned Chinese characters; VF being V-V₀Sample subsets not in the PNN learned Chinese characters;

(3) to V^TIf it corresponds to an identified network PN_iCan correctly identify it without the need of going to RN_iCarry out training whether or notThen these samples are added to the RN_iOriginal sample concentration, retraining RN_i；

(4) To V^FTraining new recognition networks RN ', RN', … and RN ', and setting the divided Chinese character subsets as U1', U2 ', … and Uq';

(5) and retraining the control network CN by using the U.V sample set.

The invention mainly comprises four steps of scanning input, fuzzy enhancement and clustering segmentation, image data binarization, Chinese character matching through a parallel neural network and the like, and aiming at the characteristics of Chinese character recognition, new algorithms in the field of artificial intelligence are adopted, and the system can realize the recognition of the printed Chinese characters.

The invention has the beneficial effects that:

(1) the method adopts new algorithms in the field of artificial intelligence, thereby effectively improving the quality of Chinese character recognition;

(2) the designed network can store training mode vectors with strong correlation; the memory capacity of the network is not limited and even all 2's can be stored^NTraining mode vectors, and meanwhile, the connection weight of the network only takes 1, 0 or-1, so that the network is easy to realize optically;

(3) the system adopts a maximum variance threshold setting method, and the method can obtain more satisfactory results regardless of whether the image histogram has obvious double peaks or not.

Drawings

FIG. 1 is a schematic structural diagram of a printed Chinese character recognition system in embodiment 1 of the present invention.

Detailed Description

The following further description is made in conjunction with the accompanying drawings and examples.

As shown in FIG. 1, a system for recognizing Chinese characters in printed form, which adopts a new algorithm in the field of artificial intelligence to effectively improve the quality of Chinese character recognition, comprises:

scanning input, converting the Chinese characters printed on the paper surface into electric signals through a photoelectric conversion device to form digital signals with multiple gray levels, and inputting the digital signals into a computer for processing; the fuzzy enhancement and clustering segmentation module comprises fuzzy enhancement, smoothing and multi-gray-level optimal segmentation; after image data binarization, image smoothing and enhancement and multi-gray-level cluster segmentation, carrying out binarization by adopting an integral threshold value selection method, firstly counting a gray histogram of the image, then segmenting the histogram into two groups at a certain threshold value, and determining a threshold value when the variance between the two groups is maximum; the system adopts a parallel neural network method, roughly classifies samples through a control network CN, finely classifies each rough class through an identification network RN, and accordingly identifies the Chinese characters.

The clustering segmentation performs the following steps:

J_{d} (m_{d}, C) = Σ_{i = 1}^{md} \underset{j = 1}{Σ} r_{j} - r_{i}^{*} j &Element; C_{i} d = 1,2, \cdot \cdot \cdot, e

r_{i}^{* *} = Σ_{i = 1}^{u} P_{ij} r_{ij} / \underset{i = 1}{Σ} P_{ij} i = 1,2, \cdot \cdot \cdot, m^{*}

Wherein m is^*Is the number of classes that is optimally partitioned.

The image data is binarized to perform the following steps:

(1) counting the total number of pixels in the image

And the probability of each gray level p (i) = N (i)/N;

C_{0} : ω_{0} = Σ_{i = 1}^{k} P_{i} = ω (k)

C_{1} : ω_{1} = Σ_{i = k + 1}^{M} P_{i} = 1 - ω (k)

C₀Average value of (2)

μ_{0} = Σ_{i = 1}^{k} (i^{*} P_{i}) / ω_{0} = μ (k) / ω (k)

C₁Average value of (2)

μ_{1} = Σ_{i = k + 1}^{k} (i^{*} P_{i}) / ω_{1} = (μ - μ (k)) / (1 - ω (k))

Where μ = ∑ ip_iIs the average of the gray levels of the overall image;is the average value of the gray scale when the image threshold is k;

The method for recognizing Chinese characters in printed form adopts a parallel neural network method, PNN is specially provided for recognizing a huge sample set of Chinese characters, a control network CN is used for roughly classifying samples, and a recognition network RN is used for finely classifying coarse classes so as to recognize Chinese characters. Both CN and RN adopt Hopfield network, and use the 'outer product equal criterion' as learning rule of associative memory instead of the common outer product and criterion or Hebb learning rule. The system also adopts a new additional learning algorithm which simulates the character learning process of human beings, namely, learning part of Chinese characters first, then gradually learning the rest Chinese characters, and finally realizing the learning of all Chinese characters.

Claims

1. A system for identifying printed chinese characters, the system comprising:

2. The system for identifying printed chinese characters of claim 1, wherein said fuzzy enhancement and smoothing module employs a model proposed by s.k.pal, etc. from input to output, applying fuzzy contrast enhancement operator twice for fuzzy enhancement, and smoothing one time in between, in preparation for the next enhancement.

3. The system for identifying printed chinese characters of claim 1, wherein an image enhanced histogram: having S brightness levels r₁,r₂,…,r_sWith a corresponding probability density of P₁,P₂,…,P_sThe system rewrites the objective function as:

J (m, c) = Σ_{i = 1}^{m} \underset{j = 1}{Σ} r_{j} - r_{i}^{*} j &Element; C_{i}

wherein,

is C_iThe cluster center of (a); r is_jIs of class C_iThe jth sample in (a); m is the number of clusters, 2<m<s; if we want to group S samples into m classes, the selection of the cluster center is the brightness corresponding to the first m larger probability densities in the histogram.

4. The system for identifying printed Chinese characters as claimed in claim 1 or 3, wherein said clustering segmentation of S intensity clusters is performed by:

J_{d} (m_{d}, C) = Σ_{i = 1}^{md} \underset{j = 1}{Σ} r_{j} - r_{i}^{*} j &Element; C_{i} d = 1,2, \cdot \cdot \cdot, e;

r_{i}^{* *} = Σ_{i = 1}^{u} P_{ij} r_{ij} / \underset{i = 1}{Σ} P_{ij} i = 1,2, \cdot \cdot \cdot, m^{*};

Wherein m is^*Is the number of classes that is optimally partitioned.

5. The system for identifying printed Chinese characters as claimed in claim 1, wherein the binarization of image data employs a maximum variance threshold setting method, and comprises the steps of firstly counting a gray histogram of the image, then dividing the histogram into two groups at a certain threshold, and determining the threshold when the variance between the two groups is maximum, and performing the following steps:

(1) counting the total number of pixels in the image

And the probability of each gray level p (i) = N (i)/N;

C_{0} : ω_{0} = Σ_{i = 1}^{k} P_{i} = ω (k)

C_{1} : ω_{1} = Σ_{i = k + 1}^{M} P_{i} = 1 - ω (k)

C₀Average value of (2)

μ_{0} = Σ_{i = 1}^{k} (i^{*} P_{i}) / ω_{0} = μ (k) / ω (k)

C₁Average value of (2)

μ_{1} = Σ_{i = k + 1}^{k} (i^{*} P_{i}) / ω_{1} = (μ - μ (k)) / (1 - ω (k));

Where μ = ∑ ip_iIs the average of the gray levels of the overall image;

is the average value of the gray scale when the image threshold is k;

(3) changing the k value from 1 to M to obtain the above formula delta²(k) Is the maximum k, the value k at this time^*It is the desired threshold.

6. The system for identifying printed Chinese characters as recited in claim 1, wherein said Chinese character identification uses parallel neural network method to coarsely classify samples by control network CN, and uses identification network RN to finely classify coarse classes, thereby identifying Chinese characters, and CN and RN use Hopfield network.

7. The system for recognizing printed Chinese characters as claimed in claim 1 or 6, wherein a new additional learning algorithm is used to simulate the process of learning Chinese characters by human, i.e. learning some Chinese characters first, then learning the rest of Chinese characters gradually, and finally realizing the learning of all Chinese characters.

8. The system for recognition of typed chinese characters according to claim 1 or 7, wherein the additional learning algorithm performs the following steps:

(3) to V^TIf it corresponds to an identified network PN_iCan correctly identify it without the need of going to RN_iTraining is performed, otherwise the samples are added to RN_iOriginal sample concentration, retraining RN_i；

(5) and retraining the control network CN by using the U.V sample set.