CN113723421B

CN113723421B - Chinese character recognition method based on zero sample embedded in matching category

Info

Publication number: CN113723421B
Application number: CN202111038228.6A
Authority: CN
Inventors: 黄宇浩; 金连文; 彭德智
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2021-09-06
Filing date: 2021-09-06
Publication date: 2023-10-17
Anticipated expiration: 2041-09-06
Also published as: CN113723421A

Abstract

The invention relates to a Chinese character recognition method based on zero samples embedded in matching categories, which comprises the following steps: extracting visual characteristics of a Chinese character text image; performing category embedding on Chinese character categories, performing hierarchical decomposition on components of the Chinese characters by adopting a hierarchical decomposition-based embedding algorithm, and calculating to obtain corresponding embedded vectors; the method comprises the steps of embedding and mapping the category of the Chinese character category into a visual space, enabling the dimension of the Chinese character category embedding to be equal to the dimension of the visual space based on a bidirectional embedding transfer module, and retaining the original information of the Chinese character category; and matching the visual characteristics of the Chinese character text image and the Chinese character category embedded information through a CTC decoder based on the distance, and outputting a final result of Chinese character text image recognition. The invention realizes the zero-sample Chinese character text recognition by the matching type embedding method, and the method is suitable for the long Chinese character recognition and the zero-sample Chinese character recognition.

Description

Chinese character recognition method based on zero sample embedded in matching category

Technical Field

The invention relates to the technical field of pattern recognition and artificial intelligence, in particular to a Chinese character recognition method based on zero samples embedded in matching categories.

Background

Chinese characters are one of the oldest characters in the world, and are carriers of the histories and cultural inheritance of Chinese nationality to date. The Chinese character recognition is researched, the history document is electronized, and the method has important value and significance for inheritance of history culture. However, the types of Chinese characters are huge, besides 4000 Chinese characters used daily, the number of Chinese character types recorded in history and academic stock is more than 85000, and the Chinese characters are usually in the form of rarely used words, complex words and foreign words, and samples are often difficult to collect manually for acquisition. At present, a Chinese text recognition model is commonly used for recognizing characters by combining a convolutional neural network and a mechanism of CTC decoding or Attention decoding. And a data-driven scheme is generally adopted, namely, a large amount of data is collected or synthesized for each Chinese character category to train a model, which is applicable to the recognition of common Chinese characters. However, for rare words, complex words, and foreign words, it is difficult to collect and obtain the true samples, so obtaining and labeling sufficient data is a time and money consuming task, and it is difficult to synthesize such samples.

Aiming at the problems, a Chinese character classification method based on zero samples embedded in matching categories is adopted, and recognition of rare words, complex words and variant word samples is realized by only learning part features in common Chinese character samples.

Disclosure of Invention

The invention aims to overcome the defects and shortcomings of the prior art, and provides a zero-sample Chinese character recognition method based on matching category embedding, which solves the problem of zero-sample Chinese character recognition and realizes recognition of rare words, complex words and foreign word samples.

In order to achieve the above object, the present invention provides the following solutions:

a Chinese character recognition method based on zero samples embedded in matching categories comprises the following steps:

extracting visual characteristics of a Chinese character text image;

performing category embedding on Chinese character categories, performing hierarchical decomposition on components of the Chinese characters by adopting a hierarchical decomposition-based embedding algorithm, and calculating to obtain corresponding embedded vectors;

the Chinese character category is embedded and mapped into a visual space, and based on a bidirectional embedding transfer module, the dimension of the Chinese character category embedding is equal to the dimension of the visual space, and the original information of the Chinese character category is reserved;

and matching the visual characteristics of the Chinese character text image and the Chinese character category embedded information through a CTC decoder based on the distance, and outputting a final result of Chinese character text image recognition.

Preferably, a text encoder based on a convolutional neural network is adopted to extract the visual characteristics of the Chinese character text image.

Preferably, the extracting the visual feature of the chinese character text image by the text encoder based on convolutional neural network specifically includes:

and extracting by adopting a ResNet18 model as a backbone network, removing a last full-connection layer of the backbone network, and replacing a last output global average pooling layer with pooling only for the height of the feature map so that the output height of the feature map is 1 and the width of the feature map is kept unchanged.

Preferably, a dropout strategy is adopted at the output of the last convolution layer of the backbone network, and the probability of dropout is set to 0.3, so as to prevent the network from over-fitting.

Preferably, the hierarchical decomposition embedding algorithm specifically includes:

obtaining the components and the structure of the Chinese characters according to the ideographic description sequence of the Chinese characters; then, the components and the structure of the Chinese characters are embedded according to the function of the embedding algorithm, so that the corresponding Chinese character category embedding is obtained, and the function expression is as shown in the formula (1):

wherein n is _i Representing components in component set R, n _j Representing structures in the structure set S, y _n A vector is encoded for one hot of a component or structure, lambda is a super parameter, and the set value is 0.5; v _n Is an influence factor of a component or a structure, and can be calculated by the following formula (2):

wherein alpha and beta are super parameters, set to 0.5 and 0.001, p respectively _i Represented are nodes on the root node to leaf node path, and l represents the length of the path.

Preferably, the bidirectional embedded transfer module is composed of a forward full connection layer and a reverse full connection layer, and the two full connection layers share parameters.

Preferably, the forward full connection layer maps category embeddings of Chinese characters to the visual space for making the dimension of category embeddings equal to the dimension of visual features of the text image.

Preferably, the reverse full-connection layer is formed by a transpose of a parameter matrix of the forward full-connection layer, and through the reverse full-connection layer, the class embedding can be reconstructed, and a reconstruction loss function is adopted to calculate a mean square error between the reconstruction class embedding and the original class embedding, so that the class embedding mapped to the visual space can retain original information thereof.

Preferably, the specific operations of the distance-based CTC decoder include:

and calculating the distance between the visual features and the Chinese character category embedding by adopting a cosine similarity function, wherein the expression is as follows:

wherein V represents visual characteristics, phi' represents category embedding after mapping; substituting the cosine similarity between the visual features and the category embedding into a CTC loss function based on distance to serve as an optimization target of the network after the cosine similarity between the visual features and the category embedding is calculated;

the distance-based CTC loss function expression is:

wherein l _i Is a label, alpha is a learnable parameter, and the magnitude of cosine similarity can be adjusted.

The beneficial effects of the invention are as follows:

(1) The invention designs a zero sample recognition model aiming at Chinese character text, solves the problems that the existing text recognition method depends on a large amount of labeled training data and is difficult to recognize zero sample data, so that the text recognition model has better generalization capability, can recognize foreign words, rare words and complex words, has simple and flexible realization process, and can be suitable for the existing text recognition frame.

(2) Compared with most of the existing zero sample recognition methods, the zero sample recognition method only focuses on the recognition problem of Chinese characters, so that the zero sample recognition method has more challenging and practical application value.

(3) The invention adopts a matching type embedding method and a distance-based CTC decoder, so that the model can identify zero samples and process long texts, and solves the problems that the existing zero sample identification method based on Attention needs long training time and is not suitable for long text identification.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the drawings that are needed in the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a schematic diagram of a text encoder of the present invention;

FIG. 3 is a schematic diagram of a hierarchical exploded structure of Chinese characters according to the present invention;

FIG. 4 is a schematic diagram of a bi-directional embedded module of the present invention;

fig. 5 is a schematic diagram of a distance-based CTC decoder of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.

The invention relates to a Chinese character recognition method based on zero samples embedded in matching categories, which is shown in figure 1 and comprises the following steps:

s1, extracting visual features of a Chinese character text image, wherein a text encoder based on a convolutional neural network is adopted to extract the visual features of the Chinese character text image, and the method specifically comprises the following steps:

the ResNet18 model is used as a backbone network to extract visual features of the text image, as shown in FIG. 2, and the text encoder takes the text image as input and outputs a one-dimensional sequence of visual features. The ResNet18 model removes the last full connection layer of the network, so that the network is only used for extracting the features, and the height of the output feature map is subjected to average pooling, so that the output height of the feature map is 1, and the width of the feature map is kept unchanged, thereby obtaining a one-dimensional visual feature sequence. In addition, in order to prevent network overfitting, a dropout strategy is adopted at the output of the last convolutional layer to prevent network overfitting, and the probability of dropout is set to 0.3.

S2, performing category embedding on Chinese character categories, performing hierarchical decomposition on components of the Chinese characters by adopting an algorithm based on hierarchical decomposition embedding, and calculating to obtain corresponding embedded vectors, wherein the method specifically comprises the following steps:

the components and structure of the Chinese character are obtained according to the ideographic description sequence of the Chinese character, as shown in figure 3. Then, embedding the components and the structure of the Chinese characters according to the function of the hierarchical decomposition embedding algorithm to obtain corresponding Chinese character category embedding, wherein the function is expressed as follows:

wherein n is _i Representing components in component set R, n _j Representing structures in the structure set S, y _n The method is a one hot coding vector of a component or a structure, wherein the dimensions of the one hot coding vector are consistent, and lambda is a super parameter of two front and rear items of a balanced sub, and the set value is 0.5; v _n Is an influencing factor for a component or structure, which can be calculated by the following formula:

wherein alpha and beta areSuper parameters, set to 0.5 and 0.001, p respectively _i Represented are nodes on the root node to leaf node path, and l represents the length of the path. Further, the "blank" symbol to be used in CTCs is represented by a one hot code vector. Because the hierarchical decomposition embedding algorithm has embedded the component and structure information for each Chinese character category, it can be used to represent both visible Chinese character categories and invisible Chinese character categories.

S3, based on the Chinese character category embedding, adopting a bidirectional embedding transfer module to embed and map the category of the Chinese character category into a visual space, and specifically comprising the following steps:

the bidirectional embedded transfer module, as shown in fig. 4, is composed of a forward full-connection layer and a reverse full-connection layer, and the two full-connection layers share parameters. The forward full connection layer maps the category embedding of the Chinese characters to the visual space so that the dimension of the category embedding is equal to the dimension of the visual feature of the text image. The full connection layer is composed of a linear mapping function, and the input category embedding is obtained after the input category embedding passes through the full connection layer. Because the projected category embedding does not add additional constraint, the original information of the projected category embedding can be lost along with the training, so that the generalization capability of the network is weakened, and the original information of the projected category embedding is reserved by adopting a reverse full-connection layer which plays a role of additional constraint. To simplify the operation, the transpose of the forward fully connected layer parameter matrix is used as the parameters of the reverse fully connected layer. And the reverse full-connection layer takes the projected category embedding as input to obtain the reconstructed category embedding. In order to enable the network to learn this characteristic, a reconstruction loss function is employed to calculate the mean square error of the reconstructed class embedding and the original class embedding so that the class embedding mapped to visual space can retain its original information. The expression for reconstructing the loss function is:

wherein the method comprises the steps ofIs the embedding of the reconstructed category, phi _i Is the original category embedding.

S4, matching visual features of the Chinese character text image with Chinese character category embedding by adopting a CTC decoder based on distance, and outputting a recognition result of the Chinese character text image, wherein the method specifically comprises the following steps:

the distance-based CTC decoder decodes visual features by matching similar class embeddings as shown in fig. 5, and first calculates the distances between the visual features and the class embeddings by using a cosine similarity function, where the expression is:

where V represents the visual features and Φ' represents the mapped category embedding. After the cosine similarity between the visual features and the category embedding is calculated, a cosine similarity matrix d (V, phi') of the visual features and each category can be obtained, wherein d _ij The cosine similarity between pixel point i and class-embedded j in the visual feature is represented. Further, the prediction result of each pixel point on the visual characteristic can be obtained by maximizing each row of the cosine similarity matrix. However, the current prediction results are not aligned to the real tags one by one, so that the CTC loss function needs to be used to optimize the process, so that the prediction results are aligned to the real tags. The cosine similarity matrix can be substituted into the CTC loss function to serve as an optimization target of the network, so that the CTC loss function based on the distance is obtained, and the expression is as follows:

wherein l _i Is a label, and alpha is a size that can be adjusted for cosine similarity by a learnable parameter. Finally, after the repeated labels and the blank labels in the output result are removed, the recognition result of the Chinese character text image can be output.

The categories contained in the training set are referred to as visible categories, while the categories not contained in the training set are referred to as invisible categories. The goal of zero sample recognition is to enable the network to recognize invisible categories by learning the characteristics of the visible categories.

The invention realizes the zero-sample Chinese character text recognition by the matching type embedding method, and the method is suitable for the long Chinese character recognition and the zero-sample Chinese character recognition.

The above embodiments are merely illustrative of the preferred embodiments of the present invention, and the scope of the present invention is not limited thereto, but various modifications and improvements made by those skilled in the art to which the present invention pertains are made without departing from the spirit of the present invention, and all modifications and improvements fall within the scope of the present invention as defined in the appended claims.

Claims

1. A Chinese character recognition method based on zero samples embedded in matching categories is characterized by comprising the following steps:

extracting visual characteristics of a Chinese character text image;

the hierarchical decomposition embedding algorithm specifically comprises the following steps:

wherein n is _i Representing components in component set R, n _j Representing structures in the structure set S, y _n Encoding a vector, lambda, for one hot of a part or structureIs a super parameter, and the set value is 0.5; v _n Is an influence factor of a component or a structure, and can be calculated by the following formula (2):

wherein alpha and beta are super parameters, set to 0.5 and 0.001, p respectively _i Representing the nodes on the path from the root node to the leaf node, and l represents the length of the path;

matching visual characteristics of the Chinese character text image and Chinese character category embedded information through a CTC decoder based on the distance, and outputting a final result of Chinese character text image recognition;

specific operations of the distance-based CTC decoder include:

the distance-based CTC loss function expression is:

wherein l _i Is a label, alpha is a learnable parameter, and can adjust the remainderThe magnitude of the chord similarity.

2. The method for recognizing Chinese characters based on zero samples embedded in matching categories according to claim 1, wherein a text encoder based on convolutional neural network is used to extract visual features of the Chinese character text image.

3. The method for recognizing chinese characters based on zero samples embedded in matching categories according to claim 2, wherein said text encoder based on convolutional neural network extracts visual features of said chinese character text image specifically comprises:

4. The method for recognizing Chinese characters based on zero samples embedded in matching categories according to claim 3, wherein a dropout strategy is adopted at the output of the last convolution layer of the backbone network, and the probability of dropout is set to 0.3 for preventing the network from over-fitting phenomenon.

5. The method for recognizing Chinese characters based on zero samples embedded in matching categories according to claim 1, wherein the bidirectional embedded transfer module is composed of a forward full connection layer and a reverse full connection layer, and the two full connection layers share parameters.

6. The method for recognition of chinese characters based on zero samples of matching category embedding of claim 5, wherein said forward full-join layer maps category embedding of chinese characters to said visual space for making a dimension of category embedding equal to a dimension of visual features of said text image.

7. The method for recognizing chinese characters based on zero samples of matching class embedding of claim 5 wherein said reverse full-connected layer is comprised of a transpose of a parameter matrix of said forward full-connected layer, class embedding can be reconstructed by said reverse full-connected layer, and a mean square error of the reconstructed class embedding and the original class embedding is calculated by employing a reconstruction loss function, such that class embedding mapped to said visual space can retain its original information.