AU2021107299A4

AU2021107299A4 - A system for deep neural network based handwritten digit classification for low resource bengali script

Info

Publication number: AU2021107299A4
Application number: AU2021107299A
Authority: AU
Inventors: Amitava Choudhury; Abhijit Kumar
Original assignee: Individual
Current assignee: Individual
Priority date: 2021-08-25
Filing date: 2021-08-25
Publication date: 2022-01-06
Anticipated expiration: 2029-08-25

Abstract

The system comprises an acquisition module for receiving images of handwritten Bangla numbers; a pre-processing module for random resizing, cropping and random vertical and horizontal flipping to the received images, wherein the images are thereby normalized for converting the images to have zero mean and unit variance; wherein interpolation is employed for predicting new pixel values of the resultant image; a convolutional Layer equipped with a filter for scanning a particular region of the image to produce a feature map, wherein parameters in the filter are learnable and these parameters are shared for the convolutional layer, that implies for a convolutional layer, the filter is having same weights reduces the number of parameters to optimize while making the convergence faster; and an activation function for computing weighted sum of the inputs along with the bias, wherein based on this weighted sum, it is decided if a node fires or not. 21 c 0 4-J uoo (U 4-Q) Cix IL rcN Cwi 4-J =3 ar c 00 0U

Description

c 0 4-J

uoo

(U

4-Q)

IL Cix

rcN Cwi 4-J =3

ar c

U A SYSTEM FOR DEEP NEURAL NETWORK BASED HANDWRITTEN DIGIT CLASSIFICATION FOR LOW RESOURCE BENGALI SCRIPT FIELDOFTHEINVENTION

The present disclosure relates to a system for deep neural network based handwritten digit classification for low resource Bengali script.

BACKGROUND OF THE INVENTION

Numeral systems represent numeric digits in a well-defined and understandable manner. In India, numerous regional numeral systems along with the widely used Western numeral system is used in day-to-day life. The use of handwritten language is more prevalent in Indian as compared to digitized scripts. Handwritten character recognition is gaining a lot of attention in the academic world. In India a large number of languages are spoken and written and each of the regional numeric systems bring along with it a different challenge, so far character recognition is concerned. By 2019, the projected number of internet users is 627 million out of a population of about 1.37 billion. This shows that only about 46% of the Indian population is connected to the web. The remaining chunk of the population has no access to the Internet and relies on handwritten methods for official and personal work. This large population generates a huge volume of handwritten data and thereby creates need for automated systems capable of recognizing and indexing the characters. The problem becomes more intense as most of the Indian languages are under resourced.

Bangla is an Indo-Aryan script used in both Bengali and Assamese language. It is the official language of Bangladesh and is the second most widely spoken language of India. In India, the language is in common use especially in the states of West Bengal, Tripura, and Assam. Worldwide, it is spoken by about 260 million people and is the 6th most spoken language throughout the world. Bangla's cultural significance and importance therefore is clearly undeniable.

In the view of the forgoing discussion, it is clearly portrayed that there is a need to have a system for deep neural network based handwritten digit classification for low resource Bengali script.

SUMMARY OF THE INVENTION

The present disclosure seeks to provide a system for deep neural network based handwritten digit classification for robust recognition of Bengali numeric digits.

In an embodiment, a system for deep neural network based handwritten digit classification for low resource Bengali script is disclosed. The system includes an acquisition module for receiving images of handwritten Bangla numbers. The system further includes a pre processing module for random resizing, cropping and random vertical and horizontal flipping to the received images, wherein the images are thereby normalized for converting the images to have zero mean and unit variance. Interpolation is employed for predicting new pixel values of the resultant image. The system further includes a convolutional Layer equipped with a filter for scanning a particular region of the image to produce a feature map, wherein parameters in the filter are learnable and these parameters are shared for the convolutional layer, that implies for a convolutional layer, the filter is having same weights reduces the number of parameters to optimize while making the convergence faster. The system further includes an activation function for computing weighted sum of the inputs along with the bias, wherein based on this weighted sum, it is decided if a node fires or not.

In an embodiment, a pre-processing technique that randomly varies the properties of the original image is used which allows to predict characters in situations with low brightness, differing colors of the background/paper and the ink used, and the saturation of the image.

In an embodiment, brightness is chosen from a uniform distribution lying between 1 and 3, contrast and saturation is chosen from the uniform distribution that is between 1 and 3 and hue of the image is allowed to be changed from the distribution -0.1 to 0.5.

In an embodiment, interpolation is achieved by predicting the value of a given pixel by looking at the pixels neighboring the pixel in question and using the values of the neighboring pixels to predict value of the current pixel.

In an embodiment, bicubic interpolation is employed to provide a sharper image with reduced interpolation artifacts like Aliasing, Blurring, etc.

In an embodiment, element-wise multiplication is performed between the pixels of the image within the receptive field of the filter, and the weights of the filter.

In an embodiment, Fast Fourier transform turns the convolution operation into element wise multiplication reducing computation.

In an embodiment, convolution layers extract simple features like edges, etc., and with subsequent convolution layers learn more intricate and abstract patterns from the image.

In an embodiment, the activation functions are Rectified Linear Units, pooling, loss function, gradient descent, and regularization.

An object of the present disclosure is to develop a recognition system based on transfer learning has been proposed for robust recognition of Bengali numeric digits.

To further clarify advantages and features of the present disclosure, a more particular description of the invention will be rendered by reference to specific embodiments thereof, which is illustrated in the appended drawings. It is appreciated that these drawings depict only typical embodiments of the invention and are therefore not to be considered limiting of its scope. The invention will be described and explained with additional specificity and detail with the accompanying drawings.

BRIEFDESCRIPTIONOFFIGURES

These and other features, aspects, and advantages of the present disclosure will become better understood when the following detailed description is read with reference to the accompanying drawings in which like characters represent like parts throughout the drawings, wherein:

Figure lillustrates a block diagram of system for deep neural network based handwritten digit classification for low resource Bengali scriptin accordance with an embodiment of the present disclosure; Figure2illustrates a working block diagram of ResNet 18in accordance with an embodiment of the present disclosure; and Figure 3 illustrates a graphical representation of training and validation accuracy and training and validation lossin accordance with an embodiment of the present disclosure.

Further, skilled artisans will appreciate that elements in the drawings are illustrated for simplicity and may not have necessarily beendrawn to scale. For example, the flow charts illustrate the method in terms of the most prominent steps involved to help to improve understanding of aspects of the present disclosure. Furthermore, in terms of the construction of the device, one or more components of the device may have been represented in the drawings by conventional symbols, and the drawings may show only those specific details that are pertinent to understanding the embodiments of the present disclosure so as not to obscure the drawings with details that will be readily apparent to those of ordinary skill in the art having benefit of the description herein.

DETAILED DESCRIPTION

For the purpose of promoting an understanding of the principles of the invention, reference will now be made to the embodiment illustrated in the drawings and specific language will be used to describe the same. It will nevertheless be understood that no limitation of the scope of the invention is thereby intended, such alterations and further modifications in the illustrated system, and such further applications of the principles of the invention as illustrated therein being contemplated as would normally occur to one skilled in the art to which the invention relates.

It will be understood by those skilled in the art that the foregoing general description and the following detailed description are exemplary and explanatory of the invention and are not intended to be restrictive thereof.

Reference throughout this specification to "an aspect", "another aspect" or similar language means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present disclosure. Thus, appearances of the phrase "in an embodiment", "in another embodiment" and similar language throughout this specification may, but do not necessarily, all refer to the same embodiment.

The terms "comprises", "comprising", or any other variations thereof, are intended to cover a non-exclusive inclusion, such that a process or method that comprises a list of steps does not include only those steps but may include other steps not expressly listed or inherent to such process or method. Similarly, one or more devices or sub-systems or elements or structures or components proceeded by "comprises...a" does not, without more constraints, preclude the existence of other devices or other sub-systems or other elements or other structures or other components or additional devices or additional sub-systems or additional elements or additional structures or additional components.

Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The system, methods, and examples provided herein are illustrative only and not intended to be limiting.

Embodiments of the present disclosure will be described below in detail with reference to the accompanying drawings.

Referring to Figure 1, a block diagram of system for deep neural network based handwritten digit classification for low resource Bengali scriptis illustrated in accordance with an embodiment of the present disclosure.The system 100 includes an acquisition module 102 for receiving images of handwritten Bangla numbers.

In an embodiment, a pre-processing module 104is configured with the acquisition module 102for random resizing, cropping and random vertical and horizontal flipping to the received images, wherein the images are thereby normalized for converting the images to have zero mean and unit variance. Interpolation is employed for predicting new pixel values of the resultant image.

In an embodiment, a convolutional Layer 106is equipped with a filter for scanning a particular region of the image to produce a feature map, wherein parameters in the filter are learnable and these parameters are shared for the convolutional Layer 106, that implies for a convolutional Layer 106, the filter is having same weights reduces the number of parameters to optimize while making the convergence faster.

In an embodiment, an activation function 108is engaged with the convolutional Layer 106 of a convolutional neural network for computing weighted sum of the inputs along with the bias, wherein based on this weighted sum, it is decided if a node fires or not.

The dataset used in this study consists of 6000 images of handwritten Bangla numbers. The images are all 32x32 pixels and each image is a unique handwritten variant of the required Bangla numeral. The dataset has been divided into testing, training and validation subsets to aid the training process of the neural network. The model to learn the features of an image and learn the corresponding classifications in a supervised learning paradigm uses the training set. This split contains ten classes, one for each digit in the Bangla script numeral. Each class contains 420 unique handwritten characters for that numeral. The second split is the validation split, which is used to tune hyper parameters like the learning rate, number of epochs, and so forth. This split too contains ten classes each having 162 images. Finally, the test split to check the performance of the model has been created. This subset has ten classes with 10 images in each class.

Pre-processing is an important aspect of any neural network-based classifier. The pre-processing techniques applied herein has been discussed below:

Random resize, crop, and flip

Data augmentation is an important means of increasing the size of a dataset. It also proves useful in providing more vantage points to the model to learn from. This helps in generalizing the framework. It is shown that small distortions that are visibly hard to distinguish provide incorrect or no predictions. For all these reasons, random resizing, cropping and random vertical and horizontal flipping has been performed on the dataset. This augmentation has been done by randomly cropping the original image. This cropping has been set to be about 0.08 to 1.0 times the original image. A random aspect ratio has also been chosen which is 3/4th to 4/3rd of the original image. These settings have been verified to provide good results in image recognition tasks using CNNs. After these steps the image is resized back to the required image size for the CNN.

Random changes in brightness, saturation, hue, and contrast

In order to create a framework capable of recognizing Bengali handwritten characters in a number of situations, a pre-processing technique that randomly varies the properties of the original image is used. This augmentation allows the model to predict characters in situations with low brightness, differing colors of the background/paper and the ink used, and the saturation of the image. The brightness is chosen from a uniform distribution lying between 1 and 3. Similarly, contrast and saturation is chosen from the uniform distribution that is between 1 and 3. The hue of the image is allowed to be changed from the distribution -0.1 to 0.5. This process makes the proposed model more robust.

Normalize

Normalization is the process of centering an image. In an image the ranges of the differing features in different color channels can be vastly different. This could cause some features to dominate over others based only on numerical significance rather than the importance of the feature. Normalization helps in converting the images to have zero mean and unit variance. Normalization has proven to increase the accuracy power of a neural network-based classifier like CNNs. In this study, the mean and standard variance is calculated for all images across all color channels. Then each channel's value is modified by using the following formula:

input~ci inputch-meanch stddevch

Interpolation

Interpolation is the technique by which new data points are predicted when the range between which the new data point is to be predicted is known. When cropping and resizing images, interpolation aids in the prediction of new pixel values of the resultant image. This is achieved by predicting the value of a given pixel by looking at the pixels neighboring the pixel in question and using the values of the neighboring pixels to predict value of the current pixel. In this study, Bicubic interpolation has been used. This technique provides a sharper image with reduced interpolation artifacts like Aliasing, Blurring, etc. It produces sharper images when compared to Linear and Bilinear interpolation. Instead of considering 4 pixels in the nearest 2x2 matrix of neighboring pixels like in Bilinear interpolation, Bicubic interpolation uses 16 pixels in the nearest 4x4 matrix of neighboring pixels. This increases the number of calculations but provides smoother images.

Convolutional Neural Networks

Convolutional Neural Networks have proven to be quite robust in their feature extraction. This makes them useful in image recognition, natural language processing, etc.

Convolution

The convolutional Layer 106 is the first layer in a Convolutional Neural Network and it takes the handwritten image. Receiving an entire image with all pixel values of an image as input, as done in the fully connected layers of an Artificial Neural Network, not only drastically increases the computation overhead but also introduces irrelevant features into the learning. This negatively impacts the model's performance and generalization. Convolutional Layers instead uses a filter to scan a particular region of the image. This region is known as the receptive field. The parameters in the filter are learnable and these parameters are shared for the convolutional Layer 106, this means that for a convolutional Layer 106, the filter has the same weights. This reduces the number of parameters to optimize while making the convergence faster. Element-wise multiplication is performed between the pixels of the image within the receptive field of the filter, and the weights of the filter. A feature map is produced as an output of the convolutional Layer 106. Fast Fourier transform turns the convolution operation into element wise multiplication reducing computation. The formula used for the convolution operation is

featuremap = input * kernel = (-1(VIFR[input]T[kernel]) (2)

In equation (2), the convolution operation is denoted by *. is the Fourier transform whereas F-1 is the inverse Fourier transform. VZ is the normalization constant.

Convolution layers extract features from the image. They start with learning simple features like edges, etc., and with subsequent convolution layers learn more intricate and abstract patterns from the image.

The size of the output produced as a result of the convolution operation is given by the formula (for each dimension)

Dimension Ldimensio+2pkj+1 (3)

In equation (3), dimension is the length of the dimension of the image (height, width). pis the padding applied on the image. Padding is the process of adding zeroes along the height and width of the image. Without padding the kernel lands on the corners much less frequently in comparison to the pixels in the center, this skews the learning of the network. Furthermore because of the aforementioned reason, the feature map size reduces after each convolution operation, this would hinder layering of layers. For all of these reasons padding is performed on the images. kis the kernel size. sis the stride length, it is the distance between successive kernel positions.

Activation function

Activation functions 108 are used to compute the weighted sum of the inputs along with the bias. Based on this weighted sum, it is decided if a node fires or not. Activation functions 108 can be linear or non-linear. Non-linear Activation functions 108 are used to allow for more complex learning by the network. Some Activation functions 108 used in the network are:

Rectified Linear Units (ReLU)

Rectified Linear Units activation function clamps the negative value at 0. Basically, for values of x that are less than 0, the output becomes zero. For values of x greater than 0, a linear function is produced as an output. The function used for the implementation of ReLU is computationally cheaper than Activation functions 108 like tanh and sigmoid, which involve expensive operations like exponentiation. The formula that lies behind rectified linear units is:

0) = max(w(i)Tx,0) =w(i)Tx, Ifx < 0s 0, otherwise

In equation (4), h0 gives the activation of a hidden layer. w(Ois hidden weight matrix of a hidden layer. xis the input.

ReLU faces an issue where for low values, the output is zero which makes it such that optimization techniques will not update that neuron. Adding to this, during the forward pass if the output are positive then back propagation is allowed otherwise it isn't. To combat these issues with ReLU, leaky ReLUs have been proposed.

Pooling

Pooling layers perform down sampling and help in dimensionality reduction which aids in achieving translational invariance. This layer also helps in avoiding over fitting by making the network's learning more general. Like the convolution operation, pooling too has hyperparameters like filter size, stride, and padding. There are two types of pooling operations, Max pooling and Global Average pooling. In Max pooling a filter is applied on the feature map. The filter is then moved all over the feature map with the value specified by the stride.

a = maxNN(au(n,n)) (7)

Equation (7) specifies the max pooling operation; it finds the maximum value encountered by the filter. Here, u(n,n) is the filter applied on the feature map.

The output dimensions are given by:

dimension =LdiensiokI +1 (8) S

Loss Function

For a network to learn, it is important to first evaluate how distant from the actual value the predictions are. To do this in a quantitative manner, loss functions are used. Easily differentiable functions are chosen as loss functions to ease the task of back propagation. In this method, Cross-Entropy loss has been used as the loss function.

N WT xi+by. Loss = log n (9) N =1 e Wxi+bj j=1

In equation (9), W are the weights vector, b is the bias, xi is the training sample, yj is the class of the xfh training sample, N is the total number of samples, W and Wy are the j'h and y!h column of the weights vector.

Gradient Descent

Gradient Descent is performed on the learnable parameters of the network. In this operation, the parameters P are varied by a small change in the parameters 5P << P. The small variation is chosen in such a manner that the loss of the network reduces. In this method, Stochastic Gradient Descent (SGD) has been used. In SGD, the parameters are updated for each training example, because of which redundancy of computation is reduced which increases the speed of learning.

P = P - rl. 7 8J(; x(); y (0) (10)

In equation (10), P are the parameters, r7 is the learning rate, J(;x(0;y(0) is the loss function, x1 is the ih training example, yi is the label of the ith training example, and 70 is the gradient of the loss function.

SGD faces difficulty in finding the local minima of an error space characterized by difference in "steepness" across different dimensions. In such scenarios, SGD makes slower progress towards the minima and tends to oscillate. Momentum diminishes this oscillation and increases the speed of SGD in the required direction:

v = YVt1+ 77FJ(0) (11)

P = P - vt

In equation (11), y is the momentum term and in this method, it has been set to 0.9.Learning rate has been set to 0.001 in this study. The learning rate is made to decay after every 7 epochs by a factor of 0.1. Decaying the learning rate leads to faster convergence to the local minima and higher accuracy.

Regularization

A major problem faced while training CNNs is over fitting. Over fitting leads to good performance on the training set but extremely poor performance on the validation set. The network, in this state, learns the training data too well and loses all capability to generalize. To combat this problem regularization techniques like L2, L1, Dropout, etc. are used. In this study Dropout has been used for regularization.

In Dropout, co-adaptions are reduced by randomly dropping off of some connections in a network. Because of this there is no guarantee of the availability of a particular hidden neuron.

Pre trained Networks

ImageNet

ImageNetis the top performer of the ILSVRC 2010. It contains of eight layers in total, five of which are convolutional and three layers are fully connected. Finally, the Soft Max function is used to output the class scores. The activation function used is Rectified Linear Unit (ReLU). To prevent over fitting, Data augmentation techniques and Dropout is used. The number of parameters is about 60 million. The smaller size leads and small number of parameters, it is easier to train in comparison to VGGNet. This light weightiness comes at the cost of accuracy.

Figure 2 illustrates a working block diagram of Res Net 18in accordance with an embodiment of the present disclosure. Res Net seeks to solve the problem of loss in accuracy as the network becomes deeper. This problem of vanishing gradient and degradation of accuracy is dealt with the help of skip or shortcut connections in the Res Net model. A diagrammatic representation of the residual block. Instead of approximating a function, the layers try to approximate a residual function. Formally, if F(x) is the function that the layers are trying to approximate, and x is the input, the residual function is denoted by R(x) = F(x) - x, the original function to approximate now becomes R(x) +

x.

Figure 3 illustrates a graphical representation of training and validation accuracy and training and validation lossin accordance with an embodiment of the present disclosure. The 18-layer variant of the residual network, ResNet-18 has been used. It contains eighteen layers, seventeen of which are convolutional layers, followed by one fully connected layer which produces the final output. A Batch Normalization layer is present after each convolutional layer. Batch Normalization is used for the normalization of the inputs inside the network. Every mini batch is normalized to a unit standard deviation and a mean of zero. The images areresizing to 24 x 24 pixels. Data Augmentation techniques like random cropping, random changes in brightness, and saturation along with several affine transformations are applied as discussed in the earlier sections. Two approaches are used for the recognition task. In one approach the pre-trained ResNet-18 model is fine-tuned to our dataset. In this approach all weights are updatable it is the architecture of ResNet-18 that is used. The final fully connected layer is transformed to better match our dataset. A decaying learning rate is used to better improve the performance. Regularization techniques like Dropout are also used. This approach yielded an accuracy of 96%. The accuracy for each character is shown in figure 3. The model performed exceptionally well with the digits , 2, 3, 4, 5, 6, and 8 producing an accuracy of 100%. The model didn't perform well with the digit 9 and 1, this is perhaps due to further ambiguity in their structure.

In the second approach, the ResNet-18 model is used as a feature extractor and hence the weights of the underlying network are not allowed to change. Only the final fully connected layer is fine-tuned on the dataset. Same data augmentation techniques are applied as in the previous approach. A decaying learning rate is also used. This approach yielded an output of 60% which is considerably worse than the fine-tuned approach.

The drawings and the forgoing description give examples of embodiments. Those skilled in the art will appreciate that one or more of the described elements may well be combined into a single functional element. Alternatively, certain elements may be split into multiple functional elements. Elements from one embodiment may be added to another embodiment. For example, orders of processes described herein may be changed and are not limited to the manner described herein. Moreover, the actions of any flow diagram need not be implemented in the order shown; nor do all of the acts necessarily need to be performed. Also, those acts that are not dependent on other acts may be performed in parallel with the other acts. The scope of embodiments is by no means limited by these specific examples. Numerous variations, whether explicitly given in the specification or not, such as differences in structure, dimension, and use of material, are possible. The scope of embodiments is at least as broad as given by the following claims.

Benefits, other advantages, and solutions to problems have been described above with regard to specific embodiments. However, the benefits, advantages, solutions to problems, and any component(s) that may cause any benefit, advantage, or solution to occur or become more pronounced are not to be construed as a critical, required, or essential feature or component of any or all the claims.

Claims

WE CLAIM:

1. A system for deep neural network based handwritten digit classification for low resource Bengali script, the system comprises:

an acquisition module for receiving images of handwritten Bangla numbers; a pre-processing module for random resizing, cropping and random vertical and horizontal flipping to the received images, wherein the images are thereby normalized for converting the images to have zero mean and unit variance; wherein interpolation is employed for predicting new pixel values of the resultant image; a convolutional Layer equipped with a filter for scanning a particular region of the image to produce a feature map, wherein parameters in the filter are learnable and these parameters are shared for the convolutional layer, that implies for a convolutional layer, the filter is having same weights reduces the number of parameters to optimize while making the convergence faster; and an activation function for computing weighted sum of the inputs along with the bias, wherein based on this weighted sum, it is decided if a node fires or not.

2. The system as claimed in claim 1, wherein a pre-processing technique that randomly varies the properties of the original image is used which allows to predict characters in situations with low brightness, differing colors of the background/paper and the ink used, and the saturation of the image.

3. The system as claimed in claim 1, wherein brightness is chosen from a uniform distribution lying between 1 and 3, contrast and saturation is chosen from the uniform distribution that is between 1 and 3 and hue of the image is allowed to be changed from the distribution -0.1 to 0.5.

4. The system as claimed in claim 1, wherein interpolation is achieved by predicting the value of a given pixel by looking at the pixels neighboring the pixel in question and using the values of the neighboring pixels to predict value of the current pixel.

5. The system as claimed in claim 1, wherein bicubic interpolation is employed to provide a sharper image with reduced interpolation artifacts like Aliasing, Blurring, etc.

6. The system as claimed in claim 1, wherein element-wise multiplication is performed between the pixels of the image within the receptive field of the filter, and the weights of the filter.

7. The system as claimed in claim 1, wherein Fast Fourier transform turns the convolution operation into element wise multiplication reducing computation.

8. The system as claimed in claim 1, wherein convolution layers extract simple features like edges, etc., and with subsequent convolution layers learn more intricate and abstract patterns from the image.

9. The system as claimed in claim 1, wherein the activation functions are Rectified Linear Units, pooling, loss function, gradient descent, and regularization.

Acquisition Module Pre-Processing 102 Module 104

Convolutional Layer Activation Function 106 108

Figure 1

Figure 2

Figure 3