CN114529917A

CN114529917A - Zero-sample Chinese single character recognition method, system, device and storage medium

Info

Publication number: CN114529917A
Application number: CN202210095194.2A
Authority: CN
Inventors: 黄宇浩; 毛慧芸; 周伟英
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2022-01-26
Filing date: 2022-01-26
Publication date: 2022-05-24
Anticipated expiration: 2042-01-26
Also published as: CN114529917B

Abstract

The invention discloses a zero-sample Chinese single character recognition method, a system, a device and a storage medium, wherein the method comprises the following steps: extracting visual characteristics of the Chinese single character image; the method comprises the steps of performing learnable class coding on Chinese single character classes, decomposing the component structure of the Chinese single characters, and calculating to obtain learnable class coding; mapping the category codes of the Chinese single characters into a visual space, and constraining semantic consistency of the category codes before and after mapping through a reconstruction loss function; matching the category code of the Chinese single character with the visual characteristics of the image through a transform-based decoder, acquiring the characteristics related to the category code from the visual characteristics of the image, and finally decoding and outputting the recognition result of the Chinese single character. The invention realizes the Chinese single character recognition of zero samples by a learnable category coding method, and solves the problem that the traditional Chinese single character recognition method depends on a large amount of labeled data. The invention can be widely applied to the technical field of pattern recognition and artificial intelligence.

Description

A zero-sample Chinese word recognition method, system, device and storage medium

技术领域technical field

本发明涉及模式识别与人工智能技术领域，尤其涉及一种零样本中文单字识别方法、系统、装置及存储介质。The invention relates to the technical field of pattern recognition and artificial intelligence, and in particular, to a zero-sample Chinese word recognition method, system, device and storage medium.

背景技术Background technique

中文是世界上最古老的文字之一，距今已经有几千年的历史。研究中文单字识别，对古籍资料的电子化保存具有重要的价值和意义。目前的中文单字识别方法主要依赖的是以数据为驱动的深度学习方法，该方法需要标注大量的训练样本。然而中文的类别数量庞大，根据GB18030-2005的标准有70224种中文单字，为每一种中文单字都标注充足的数据是一件困难且耗费时间金钱的问题。近年来一些相关的工作，通过采用基于部件解码或基于部件编码的零样本识别方法来尝试解决上述的问题。但是基于部件解码的方法需要较长的解码时间和后处理操作，而基于部件的编码的方法采用的是人工设计的编码，不能够灵活地根据不同的数据进行调整。Chinese is one of the oldest languages in the world, with a history of thousands of years. The study of Chinese word recognition is of great value and significance to the electronic preservation of ancient books. The current Chinese word recognition methods mainly rely on data-driven deep learning methods, which need to label a large number of training samples. However, there are a huge number of Chinese categories. According to the GB18030-2005 standard, there are 70,224 Chinese words. It is a difficult and time-consuming problem to label sufficient data for each Chinese word. In recent years, some related works try to solve the above problems by adopting component decoding-based or component-coding-based zero-sample recognition methods. However, the component-based decoding method requires long decoding time and post-processing operations, while the component-based coding method adopts artificially designed codes and cannot be flexibly adjusted according to different data.

发明内容SUMMARY OF THE INVENTION

为至少一定程度上解决现有技术中存在的技术问题之一，本发明的目的在于提供一种零样本中文单字识别方法、系统、装置及存储介质。In order to solve one of the technical problems existing in the prior art at least to a certain extent, the purpose of the present invention is to provide a zero-sample Chinese word recognition method, system, device and storage medium.

本发明所采用的技术方案是：The technical scheme adopted in the present invention is:

一种零样本中文单字识别方法，包括以下步骤：A zero-sample Chinese word recognition method, comprising the following steps:

提取中文单字图像的视觉特征；Extract visual features of Chinese single-character images;

对中文单字类别进行可学习的类别编码，采用深度优先搜索的算法，对中文单字的部件结构进行分解，并计算得到可学习的类别编码；Carry out learnable category coding for Chinese word categories, use the depth-first search algorithm to decompose the component structure of Chinese single words, and calculate the learnable category coding;

将所述中文单字的类别编码映射到视觉空间中，基于全连接层的映射模块，使得中文单字的类别编码的维度等于视觉空间的维度，并通过重构损失函数来约束类别编码在映射前后的语义一致性；The category encoding of the Chinese word is mapped to the visual space, based on the mapping module of the fully connected layer, so that the dimension of the category encoding of the Chinese word is equal to the dimension of the visual space, and the reconstruction loss function is used to constrain the category encoding before and after the mapping. semantic consistency;

通过基于transformer的解码器，匹配中文单字的类别编码和图像的视觉特征，从图像的视觉特征上获取与类别编码相关的特征，最终解码输出中文单字的识别结果。Through the transformer-based decoder, the category encoding of the Chinese word and the visual features of the image are matched, and the features related to the category encoding are obtained from the visual features of the image, and finally the recognition result of the Chinese word is decoded and output.

进一步地，所述提取中文单字图像的视觉特征，包括：Further, the described extraction of the visual features of Chinese single-character images includes:

采用基于密集连接的卷积神经网络的图像编码器，提取所述中文单字图像的视觉特征。An image encoder based on a densely connected convolutional neural network is used to extract the visual features of the Chinese single-character images.

进一步地，所述图像编码器采用DenseNet121模型作为主干网络，用于提取图像的视觉特征；Further, the image encoder adopts the DenseNet121 model as the backbone network for extracting the visual features of the image;

所述主干网络采用8倍下采样的方式，为了使得输出的视觉特征能够更好地与类别编码进行匹配，所述主干网络去掉最后输出的激活层和全局平均池化层。The backbone network adopts an 8-fold downsampling method. In order to make the output visual features better match the category codes, the backbone network removes the last output activation layer and global average pooling layer.

进一步地，所述对中文单字类别进行可学习的类别编码，采用深度优先搜索的算法，对中文单字的部件结构进行分解，并计算得到可学习的类别编码，包括：Further, the learnable category coding is carried out to the Chinese single-character category, and the algorithm of depth-first search is used to decompose the component structure of the Chinese single-character, and the learnable category coding is obtained by calculation, including:

根据中文的表意文字序列词典，通过深度优先搜索算法得到分解后的中文单字的部件序列，所述部件序列表示为树的数据结构，得到每个部件的深度信息和相对位置信息；其中，深度信息表示的是部件在树中的深度，相对位置信息表示的是部件相对其父结点的位置；According to the Chinese ideographic sequence dictionary, the component sequence of the decomposed Chinese word is obtained through the depth-first search algorithm, and the component sequence is represented as a tree data structure, and the depth information and relative position information of each component are obtained; wherein, the depth information It represents the depth of the component in the tree, and the relative position information represents the location of the component relative to its parent node;

计算得到每个中文单字相应的可学习的类别编码，计算过程表示如式(1)所示：The learnable category code corresponding to each Chinese word is calculated, and the calculation process is expressed as formula (1):

其中，i表示部件序列R中的一个部件，l_i表示的是该部件的深度信息，γ_i表示的是该部件的相对位置信息，α和β是可学习的参数，y_i为部件的one-hot编码；Among them, _i represents a component in the component sequence R, li represents the depth information of the component, γ _i represents the relative position information of the component, α and β are learnable parameters, _yi is the one of the component -hot encoding;

将计算获得的可学习的类别编码与每个部件的深度信息以及相对位置信息，在维度上进行拼接，得到最终的可学习的类别编码，计算过程表示如式(2)所示：The learnable category code obtained by calculation and the depth information and relative position information of each component are spliced in dimension to obtain the final learnable category code. The calculation process is expressed as formula (2):

其中，

和

表示的是归一化后的深度信息和相对位置信息，

表示的是拼接操作。in,

and

Represents the normalized depth information and relative position information,

Represents the concatenation operation.

进一步地，所述基于全连接层的映射模块由一个全连接层构成；所述全连接层的输出元素都是输入元素经过线性运算得到；Further, the fully-connected layer-based mapping module is composed of a fully-connected layer; the output elements of the fully-connected layer are all input elements obtained through linear operations;

所述全连接层将中文单字的类别编码映射到视觉空间中，用于使类别编码的维度等于所述视觉空间的维度。The fully connected layer maps the category codes of Chinese words into the visual space, so that the dimension of the category codes is equal to the dimension of the visual space.

进一步地，所述重构损失函数用于计算映射前后类别编码的均方误差，计算过程表示如式(3)所示：Further, the reconstruction loss function is used to calculate the mean square error of the category coding before and after the mapping, and the calculation process is expressed as formula (3):

其中，L_re是重构损失函数，

表示的是映射后的类别编码，φ(y_i)表示的是映射前的类别编码，b和w^T是全连接层的偏置和权重的转置，N是类别编码的数量。where L _re is the reconstruction loss function,

represents the class code after mapping, φ(y _i ) represents the class code before mapping, b and w ^T are the transposition of the bias and weight of the fully connected layer, and N is the number of class codes.

进一步地，所述基于transformer解码器的具体操作包括：Further, the specific operations based on the transformer decoder include:

采用多头注意力机制，匹配中文单字的类别编码和图像的视觉特征，并从图像的视觉特征上获取与类别编码相关的特征，计算过程表示如式(4)所示：The multi-head attention mechanism is used to match the category encoding of Chinese words and the visual features of the image, and obtain the features related to the category encoding from the visual features of the image. The calculation process is expressed as formula (4):

MultiHead(Q，K，V)＝Concat(head₁，...，head_h)W^O MultiHead(Q, K, V)=Concat(head ₁ ,...,head _h )W ^O

head_i＝Attention(QW_i ^Q，KW_i ^K，VW_i ^V)head _i =Attention(QW _i ^Q , KW _i ^K , VW _i ^V )

其中，多头注意力通过MultiHead(Q，K，V)实现，注意力通过Attention(Q，K，V)进行计算，Q表示的是中文单字的类别编码，K和V表示的是图像的视觉特征，W_i ^Q、W_i ^K、W_i ^V都是可学习的投影矩阵，d_k表示的是Q、K、V的维度大小，W^O表示多头注意力的参数；Among them, multi-head attention is realized by MultiHead(Q, K, V), attention is calculated by Attention(Q, K, V), Q represents the category encoding of Chinese words, and K and V represent the visual features of images , Wi ^Q , Wi ^K , Wi _V are all ^learnable projection matrices, d _k represents the dimensions of Q, _K , and _V , and ^WO represents the parameters of multi-head attention;

在得到与类别编码相关的特征后，采用前馈神经网络对所述特征进行解码，前馈神经网络由三个全连接层构成，最终输出中文单字的识别结果；在前馈神经网络训练阶段，采用了交叉熵损失函数去作为网络的优化目标，所述交叉熵损失函数的表达式为：After obtaining the features related to the category coding, the features are decoded by the feedforward neural network. The feedforward neural network consists of three fully connected layers, and finally outputs the recognition results of Chinese words; in the feedforward neural network training stage, The cross-entropy loss function is adopted as the optimization target of the network, and the expression of the cross-entropy loss function is:

其中p_i是类别i的标签概率，q_i是类别i的预测概率，k是总的类别数。where pi is the label probability for class _i , qi is the predicted probability for class _i , and k is the total number of classes.

本发明所采用的另一技术方案是：Another technical scheme adopted by the present invention is:

一种零样本中文单字识别系统，包括：A zero-sample Chinese word recognition system, comprising:

特征提取模块，用于提取中文单字图像的视觉特征；The feature extraction module is used to extract the visual features of Chinese single-character images;

类别编码模块，用于对中文单字类别进行可学习的类别编码，采用深度优先搜索的算法，对中文单字的部件结构进行分解，并计算得到可学习的类别编码；The category encoding module is used to perform learnable category encoding for Chinese word categories, using a depth-first search algorithm to decompose the component structure of Chinese words, and calculate the learnable category encoding;

信息映射模块，用于将所述中文单字的类别编码映射到视觉空间中，基于全连接层的映射模块，使得中文单字的类别编码的维度等于视觉空间的维度，并通过重构损失函数来约束类别编码在映射前后的语义一致性；The information mapping module is used to map the category encoding of the Chinese word into the visual space, based on the mapping module of the fully connected layer, so that the dimension of the category encoding of the Chinese word is equal to the dimension of the visual space, and is constrained by the reconstruction loss function Semantic consistency of category encoding before and after mapping;

信息匹配模块，用于通过基于transformer的解码器，匹配中文单字的类别编码和图像的视觉特征，从图像的视觉特征上获取与类别编码相关的特征，最终解码输出中文单字的识别结果。The information matching module is used to match the category encoding of Chinese words and the visual features of the image through the transformer-based decoder, obtain the features related to the category encoding from the visual features of the image, and finally decode and output the recognition result of the Chinese word.

一种零样本中文单字识别装置，包括：A zero-sample Chinese word recognition device, comprising:

至少一个处理器；at least one processor;

至少一个存储器，用于存储至少一个程序；at least one memory for storing at least one program;

当所述至少一个程序被所述至少一个处理器执行，使得所述至少一个处理器实现上所述方法。When the at least one program is executed by the at least one processor, the at least one processor implements the above method.

一种计算机可读存储介质，其中存储有处理器可执行的程序，所述处理器可执行的程序在由处理器执行时用于执行如上所述方法。A computer-readable storage medium in which a processor-executable program is stored, the processor-executable program, when executed by the processor, is used to perform the method as described above.

本发明的有益效果是：本发明通过可学习的类别编码方法，实现了零样本的中文单字识别，解决了现有中文单字识别方法依赖于大量有标注数据，以及需要耗费时间和金钱标注数据的问题。The beneficial effects of the present invention are as follows: the present invention realizes zero-sample Chinese word recognition through a learnable category coding method, and solves the problem that the existing Chinese word recognition method relies on a large amount of labeled data, and requires time and money to label data. question.

附图说明Description of drawings

为了更清楚地说明本发明实施例或者现有技术中的技术方案，下面对本发明实施例或者现有技术中的相关技术方案附图作以下介绍，应当理解的是，下面介绍中的附图仅仅为了方便清晰表述本发明的技术方案中的部分实施例，对于本领域的技术人员而言，在无需付出创造性劳动的前提下，还可以根据这些附图获取到其他附图。In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following descriptions are given to the accompanying drawings of the embodiments of the present invention or the related technical solutions in the prior art. It should be understood that the drawings in the following introduction are only In order to facilitate and clearly express some embodiments of the technical solutions of the present invention, for those skilled in the art, other drawings can also be obtained from these drawings without creative work.

图1是本发明实施例中一种基于可学习类别编码的零样本中文单字识别方法的步骤流程图；1 is a flow chart of steps of a zero-sample Chinese word recognition method based on learnable category coding in an embodiment of the present invention;

图2是本发明实施例中的图像编码器的示意图；2 is a schematic diagram of an image encoder in an embodiment of the present invention;

图3是本发明实施例中的可学习类别编码的示意图；3 is a schematic diagram of a learnable category code in an embodiment of the present invention;

图4是本发明实施例中的基于全连接层的映射模块的示意图；4 is a schematic diagram of a mapping module based on a fully connected layer in an embodiment of the present invention;

图5是本发明实施例中的基于transformer解码器的示意图。FIG. 5 is a schematic diagram of a transformer-based decoder in an embodiment of the present invention.

具体实施方式Detailed ways

下面详细描述本发明的实施例，所述实施例的示例在附图中示出，其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的，仅用于解释本发明，而不能理解为对本发明的限制。对于以下实施例中的步骤编号，其仅为了便于阐述说明而设置，对步骤之间的顺序不做任何限定，实施例中的各步骤的执行顺序均可根据本领域技术人员的理解来进行适应性调整。The following describes in detail the embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein the same or similar reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the accompanying drawings are exemplary, only used to explain the present invention, and should not be construed as a limitation of the present invention. The numbers of the steps in the following embodiments are only set for the convenience of description, and the sequence between the steps is not limited in any way, and the execution sequence of each step in the embodiments can be adapted according to the understanding of those skilled in the art Sexual adjustment.

在本发明的描述中，需要理解的是，涉及到方位描述，例如上、下、前、后、左、右等指示的方位或位置关系为基于附图所示的方位或位置关系，仅是为了便于描述本发明和简化描述，而不是指示或暗示所指的装置或元件必须具有特定的方位、以特定的方位构造和操作，因此不能理解为对本发明的限制。In the description of the present invention, it should be understood that the azimuth description, such as the azimuth or position relationship indicated by up, down, front, rear, left, right, etc., is based on the azimuth or position relationship shown in the drawings, only In order to facilitate the description of the present invention and simplify the description, it is not indicated or implied that the indicated device or element must have a particular orientation, be constructed and operated in a particular orientation, and therefore should not be construed as limiting the present invention.

在本发明的描述中，若干的含义是一个或者多个，多个的含义是两个以上，大于、小于、超过等理解为不包括本数，以上、以下、以内等理解为包括本数。如果有描述到第一、第二只是用于区分技术特征为目的，而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量或者隐含指明所指示的技术特征的先后关系。In the description of the present invention, the meaning of several is one or more, the meaning of multiple is two or more, greater than, less than, exceeding, etc. are understood as not including this number, above, below, within, etc. are understood as including this number. If it is described that the first and the second are only for the purpose of distinguishing technical features, it cannot be understood as indicating or implying relative importance, or indicating the number of the indicated technical features or the order of the indicated technical features. relation.

本发明的描述中，除非另有明确的限定，设置、安装、连接等词语应做广义理解，所属技术领域技术人员可以结合技术方案的具体内容合理确定上述词语在本发明中的具体含义。In the description of the present invention, unless otherwise clearly defined, words such as setting, installation, connection should be understood in a broad sense, and those skilled in the art can reasonably determine the specific meanings of the above words in the present invention in combination with the specific content of the technical solution.

如图1所示，本实施例提供一种基于可学习类别编码的零样本中文单字识别方法，采用基于可学习类别编码的零样本中文单字识别方法，通过可学习的类别编码，替代人工设计的方式，使其能根据不同的数据灵活地调整。该方法包括以下步骤：As shown in FIG. 1 , the present embodiment provides a zero-sample Chinese word recognition method based on learnable category coding. The zero-sample Chinese single-character recognition method based on learnable category coding is adopted, and the learnable category coding is used to replace the artificially designed one. way so that it can be flexibly adjusted according to different data. The method includes the following steps:

S1、提取中文单字图像的视觉特征，采用基于密集连接的卷积神经网络的图像编码器，提取所述中文单字图像的视觉特征，具体为：S1, extract the visual feature of Chinese single-character image, adopt the image encoder based on densely connected convolutional neural network, extract the visual feature of described Chinese single-character image, specifically:

采用DenseNet121模型作为主干网络去提取中文单字图像的视觉特征，如图2所示，图像编码器将单字的RGB三通道图像作为输入，输出下采样8倍后的视觉特征图。所述的DenseNet121模型，由密集连接块和过渡模块组成，为了使得输出的视觉特征能够更好地与类别编码进行匹配，所述主干网络去掉了最后输出的激活层和全局平均池化层。The DenseNet121 model is used as the backbone network to extract the visual features of Chinese single-character images. As shown in Figure 2, the image encoder takes the single-character RGB three-channel image as input, and outputs the visual feature map after downsampling by 8 times. The DenseNet121 model is composed of dense connection blocks and transition modules. In order to make the output visual features better match the category encoding, the backbone network removes the final output activation layer and global average pooling layer.

S2、对中文单字类别进行可学习的类别编码，采用深度优先搜索的算法，对中文单字的部件结构进行分解，并计算得到可学习的类别编码，具体为：S2. Carry out learnable category coding for Chinese word categories, use a depth-first search algorithm to decompose the component structure of Chinese single words, and calculate to obtain learnable category codes, specifically:

首先根据中文的表意文字序列词典，通过深度优先搜索算法得到分解后的中文单字的部件序列，该部件序列可以表示为树的数据结构，如图3所示，因此还可以得到每个部件的深度信息和相对位置信息；深度信息表示的是部件在树中的深度，相对位置信息表示的是部件相对其父结点的位置；然后计算得到每个中文单字相应的可学习的类别编码，计算过程表示如式(1)所示：First, according to the Chinese ideographic sequence dictionary, the decomposed Chinese word component sequence is obtained through the depth-first search algorithm. The component sequence can be represented as a tree data structure, as shown in Figure 3, so the depth of each component can also be obtained. information and relative position information; the depth information represents the depth of the component in the tree, and the relative position information represents the location of the component relative to its parent node; then the learnable category code corresponding to each Chinese word is calculated, and the calculation process It is expressed as formula (1):

其中，i表示部件序列R中的一个部件，l_i表示的是该部件的深度信息，γ_i表示的是该部件的相对位置信息，α和β是可学习的参数，初始值分别设置为0.5和0.001，y_i为部件的one-hot编码。Among them, _i represents a component in the component sequence R, li represents the depth information of the component, γ _i represents the relative position information of the component, α and β are learnable parameters, and the initial values are respectively set to 0.5 and 0.001, _yi is the one-hot encoding of the part.

通过设置了α和β这两个可学习的参数，网络可以在训练过程中不断地调整类别的编码，使其能够更好地与视觉特征进行匹配。最后将所述的可学习类别编码与每个部件的深度信息以及相对位置信息，在维度上进行拼接，得到最终的可学习类别编码，计算过程表示如式(2)所示：By setting two learnable parameters, α and β, the network can continuously adjust the encoding of categories during training to better match visual features. Finally, the learnable category code is spliced with the depth information and relative position information of each component in the dimension to obtain the final learnable category code. The calculation process is expressed as formula (2):

其中

和

表示的是归一化后的深度信息和相对位置信息，

表示的是拼接操作。在拼接了深度信息和相对位置信息后，类别编码不仅能够表示所包含的部件信息，也能够表示每个部件的深度和相对位置，从而使得类别编码能够包含更丰富的信息，提升网络的识别准确性。in

and

Represents the normalized depth information and relative position information,

Represents the concatenation operation. After splicing the depth information and relative position information, the category coding can not only represent the included component information, but also the depth and relative position of each component, so that the category coding can contain richer information and improve the recognition accuracy of the network. sex.

S3、基于全连接层的映射模块，将所述中文单字的类别编码映射到视觉空间中，具体为：S3. Based on the mapping module of the fully connected layer, the category encoding of the Chinese word is mapped to the visual space, specifically:

所述基于全连接层的映射模块，如图4所示，由一个全连接层构成，其输出元素都是输入元素经过线性运算得到。所述映射模块用于将类别编码转换到与图像视觉特征相同的维度，同时对深度信息和相对位置信息进行融合。为了使得类别编码在经过映射模块前后的语义一致性，采用了重构损失函数进行约束，方法是计算了映射前后类别编码的均方误差，其计算过程表示如式(3)所示：The fully-connected layer-based mapping module, as shown in FIG. 4 , is composed of a fully-connected layer, and its output elements are all input elements obtained through linear operations. The mapping module is used to convert the category code to the same dimension as the visual feature of the image, and at the same time fuse the depth information and the relative position information. In order to make the semantic consistency of the category encoding before and after the mapping module, the reconstruction loss function is used to constrain it. The method is to calculate the mean square error of the category encoding before and after the mapping. The calculation process is expressed as formula (3):

其中L_re是重构损失函数，

表示的是映射后的类别编码，φ(y_i)表示的是映射前的类别编码，b和w^T是所述全连接层的偏置和权重的转置。where L _re is the reconstruction loss function,

represents the class code after mapping, φ(y _i ) represents the class code before mapping, b and w ^T are the transposition of the bias and weight of the fully connected layer.

S4、采用基于transformer的解码器，匹配中文单字的类别编码和图像的视觉特征，解码输出中文单字的识别结果，具体为：S4. Use a transformer-based decoder to match the category encoding of Chinese words and the visual features of images, and decode and output the recognition results of Chinese words, specifically:

所述基于transformer的解码器，如图5所示，采用多头注意力机制，匹配中文单字的类别编码和图像的视觉特征，并从图像的视觉特征上获取与类别编码相关的特征，其计算过程表示如式(4)所示：The transformer-based decoder, as shown in Figure 5, adopts a multi-head attention mechanism to match the category encoding of Chinese words and the visual features of the image, and obtain the features related to the category encoding from the visual features of the image. The calculation process It is expressed as formula (4):

其中多头注意力通过MultiHead(Q，K，V)实现，注意力通过Attention(Q，K，V)进行计算，Q表示的是中文单字的类别编码，K和V表示的是图像的视觉特征，W_i ^Q、W_i ^K、W_i ^V都是可学习的投影矩阵，d_k表示的是Q、K、V的维度大小，head_i表示的是某个头的注意力。在Attention(Q，K，V)的计算中表示类别编码的Q和表示图像特征的K进行矩阵的相乘，相当于对类别编码和视觉特征进行匹配，随后通过softmax函数计算得到了注意力矩阵，最后注意力矩阵与表示视觉特征的V相乘得到了与类别编码相关的特征。Among them, multi-head attention is realized by MultiHead (Q, K, V), and attention is calculated by Attention (Q, K, V), where Q represents the category encoding of Chinese words, and K and V represent the visual features of images. Wi ^Q , Wi ^K , and Wi _V are all learnable projection matrices, d _k represents the dimensions of Q, _K , and ^V , and head _i _represents the attention of a head. In the calculation of Attention(Q, K, V), the Q representing the category code and the K representing the image feature are multiplied by the matrix, which is equivalent to matching the category code and the visual feature, and then the attention matrix is calculated by the softmax function. , and finally the attention matrix is multiplied by V representing visual features to obtain features related to category encoding.

在得到与类别编码相关的特征后，采用前馈神经网络对所述特征进行解码，前馈神经网络由三个全连接层构成，此外还采用了残差连接和层归一化的操作，最终输出中文单字的识别结果。在网络训练阶段，为了拟合网络预测结果和真实标签的分布，采用了交叉熵损失函数去作为网络的优化目标，所述交叉熵损失函数的表达式如式(5)所示：After obtaining the features related to the category encoding, the features are decoded using a feedforward neural network. The feedforward neural network is composed of three fully connected layers. In addition, residual connections and layer normalization operations are used. Finally, Output the recognition results of Chinese characters. In the network training stage, in order to fit the distribution of network prediction results and real labels, the cross-entropy loss function is used as the optimization goal of the network. The expression of the cross-entropy loss function is shown in formula (5):

其中p_i是类别i的标签概率，q_i是类别i的预测概率，K是总的类别数。where pi is the label probability of class _i , qi is the predicted probability of class _i , and K is the total number of classes.

本发明通过可学习类别编码的方法，实现了零样本的中文单字识别，该方法可根据不同的数据自适应地调整类别的编码。此外本发明实现过程简单且灵活，可以移植到主流的文字识别框架中。总的来说，本发明实施例提供的方法，相对于现有技术，至少具有如下有益效果：The invention realizes zero-sample Chinese word recognition through the method of learning category coding, and the method can adaptively adjust the coding of categories according to different data. In addition, the implementation process of the present invention is simple and flexible, and can be transplanted into the mainstream character recognition framework. In general, the method provided by the embodiments of the present invention, compared with the prior art, at least has the following beneficial effects:

(1)本发明设计了针对于中文单字的零样本识别模型，解决了现有中文单字识别方法依赖于大量有标注数据，以及需要耗费时间和金钱标注数据的问题，使得识别模型具有更好的泛化能力，并且本发明实现过程简单且灵活，可以移植到主流的文字识别框架中。(1) The present invention designs a zero-sample recognition model for Chinese words, solves the problem that the existing Chinese word recognition method relies on a large amount of labeled data, and requires time and money to label data, so that the recognition model has better Generalization ability, and the implementation process of the present invention is simple and flexible, and can be transplanted into the mainstream character recognition framework.

(2)本发明关注于中文单字的零样本识别问题，相比于现有的零样本中文单字识别方法，采用了可学习的类别编码，从而替代人工设计的方式，使其能根据不同的数据灵活地调整。(2) The present invention focuses on the problem of zero-sample recognition of Chinese words. Compared with the existing zero-sample Chinese word recognition methods, a learnable category code is adopted, thereby replacing the manual design method, so that it can be based on different data. Adjust flexibly.

(3)本发明采用了基于transformer的解码器，所述解码器能够快速地解码且不需要后处理操作，使其能够方便地应用于实际场景中。(3) The present invention adopts a transformer-based decoder, which can decode quickly and does not require post-processing operations, so that it can be easily applied in practical scenarios.

本实施例还提供一种零样本中文单字识别系统，包括：This embodiment also provides a zero-sample Chinese word recognition system, including:

本实施例的一种零样本中文单字识别系统，可执行本发明方法实施例所提供的一种零样本中文单字识别方法，可执行方法实施例的任意组合实施步骤，具备该方法相应的功能和有益效果。A zero-sample Chinese word recognition system in this embodiment can execute a zero-sample Chinese word recognition method provided by the method embodiment of the present invention, can perform any combination of implementation steps of the method embodiment, and has the corresponding functions and beneficial effect.

本实施例还提供一种零样本中文单字识别装置，包括：This embodiment also provides a zero-sample Chinese word recognition device, including:

至少一个处理器；at least one processor;

当所述至少一个程序被所述至少一个处理器执行，使得所述至少一个处理器实现图1所示方法。When the at least one program is executed by the at least one processor, the at least one processor implements the method shown in FIG. 1 .

本实施例的一种零样本中文单字识别装置，可执行本发明方法实施例所提供的一种零样本中文单字识别方法，可执行方法实施例的任意组合实施步骤，具备该方法相应的功能和有益效果。A zero-sample Chinese word recognition device in this embodiment can execute a zero-sample Chinese word recognition method provided by the method embodiment of the present invention, can perform any combination of implementation steps of the method embodiment, and has the corresponding functions and beneficial effect.

本申请实施例还公开了一种计算机程序产品或计算机程序，该计算机程序产品或计算机程序包括计算机指令，该计算机指令存储在计算机可读存介质中。计算机设备的处理器可以从计算机可读存储介质读取该计算机指令，处理器执行该计算机指令，使得该计算机设备执行图1所示的方法。Embodiments of the present application further disclose a computer program product or computer program, where the computer program product or computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium. A processor of the computer device can read the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method shown in FIG. 1 .

本实施例还提供了一种存储介质，存储有可执行本发明方法实施例所提供的一种零样本中文单字识别方法的指令或程序，当运行该指令或程序时，可执行方法实施例的任意组合实施步骤，具备该方法相应的功能和有益效果。This embodiment also provides a storage medium, which stores an instruction or program for executing a zero-sample Chinese word recognition method provided by the method embodiment of the present invention. When the instruction or program is executed, the method embodiment can be executed. Any combination of implementation steps has corresponding functions and beneficial effects of the method.

在一些可选择的实施例中，在方框图中提到的功能/操作可以不按照操作示图提到的顺序发生。例如，取决于所涉及的功能/操作，连续示出的两个方框实际上可以被大体上同时地执行或所述方框有时能以相反顺序被执行。此外，在本发明的流程图中所呈现和描述的实施例以示例的方式被提供，目的在于提供对技术更全面的理解。所公开的方法不限于本文所呈现的操作和逻辑流程。可选择的实施例是可预期的，其中各种操作的顺序被改变以及其中被描述为较大操作的一部分的子操作被独立地执行。In some alternative implementations, the functions/operations noted in the block diagrams may occur out of the order noted in the operational diagrams. For example, two blocks shown in succession may, in fact, be executed substantially concurrently or the blocks may sometimes be executed in the reverse order, depending upon the functionality/operations involved. Furthermore, the embodiments presented and described in the flowcharts of the present invention are provided by way of example in order to provide a more comprehensive understanding of the technology. The disclosed methods are not limited to the operations and logic flows presented herein. Alternative embodiments are contemplated in which the order of the various operations are altered and in which sub-operations described as part of larger operations are performed independently.

此外，虽然在功能性模块的背景下描述了本发明，但应当理解的是，除非另有相反说明，所述的功能和/或特征中的一个或多个可以被集成在单个物理装置和/或软件模块中，或者一个或多个功能和/或特征可以在单独的物理装置或软件模块中被实现。还可以理解的是，有关每个模块的实际实现的详细讨论对于理解本发明是不必要的。更确切地说，考虑到在本文中公开的装置中各种功能模块的属性、功能和内部关系的情况下，在工程师的常规技术内将会了解该模块的实际实现。因此，本领域技术人员运用普通技术就能够在无需过度试验的情况下实现在权利要求书中所阐明的本发明。还可以理解的是，所公开的特定概念仅仅是说明性的，并不意在限制本发明的范围，本发明的范围由所附权利要求书及其等同方案的全部范围来决定。Furthermore, while the invention is described in the context of functional modules, it is to be understood that, unless stated to the contrary, one or more of the described functions and/or features may be integrated in a single physical device and/or or software modules, or one or more functions and/or features may be implemented in separate physical devices or software modules. It will also be appreciated that a detailed discussion of the actual implementation of each module is not necessary to understand the present invention. Rather, given the attributes, functions, and internal relationships of the various functional modules in the apparatus disclosed herein, the actual implementation of the modules will be within the routine skill of the engineer. Accordingly, those skilled in the art, using ordinary skill, can implement the invention as set forth in the claims without undue experimentation. It is also to be understood that the specific concepts disclosed are illustrative only and are not intended to limit the scope of the invention, which is to be determined by the appended claims along with their full scope of equivalents.

所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储介质中。基于这样的理解，本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器(ROM，Read-Only Memory)、随机存取存储器(RAM，Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。The functions, if implemented in the form of software functional units and sold or used as independent products, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention can be embodied in the form of a software product in essence, or the part that contributes to the prior art or the part of the technical solution. The computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes: U disk, mobile hard disk, Read-Only Memory (ROM, Read-Only Memory), Random Access Memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program codes .

在流程图中表示或在此以其他方式描述的逻辑和/或步骤，例如，可以被认为是用于实现逻辑功能的可执行指令的定序列表，可以具体实现在任何计算机可读介质中，以供指令执行系统、装置或设备(如基于计算机的系统、包括处理器的系统或其他可以从指令执行系统、装置或设备取指令并执行指令的系统)使用，或结合这些指令执行系统、装置或设备而使用。就本说明书而言，“计算机可读介质”可以是任何可以包含、存储、通信、传播或传输程序以供指令执行系统、装置或设备或结合这些指令执行系统、装置或设备而使用的装置。The logic and/or steps represented in flowcharts or otherwise described herein, for example, may be considered an ordered listing of executable instructions for implementing the logical functions, may be embodied in any computer-readable medium, For use with, or in conjunction with, an instruction execution system, apparatus, or device (such as a computer-based system, a system including a processor, or other system that can fetch instructions from and execute instructions from an instruction execution system, apparatus, or apparatus) or equipment. For the purposes of this specification, a "computer-readable medium" can be any device that can contain, store, communicate, propagate, or transport the program for use by or in connection with an instruction execution system, apparatus, or apparatus.

计算机可读介质的更具体的示例(非穷尽性列表)包括以下：具有一个或多个布线的电连接部(电子装置)，便携式计算机盘盒(磁装置)，随机存取存储器(RAM)，只读存储器(ROM)，可擦除可编辑只读存储器(EPROM或闪速存储器)，光纤装置，以及便携式光盘只读存储器(CDROM)。另外，计算机可读介质甚至可以是可在其上打印所述程序的纸或其他合适的介质，因为可以例如通过对纸或其他介质进行光学扫描，接着进行编辑、解译或必要时以其他合适方式进行处理来以电子方式获得所述程序，然后将其存储在计算机存储器中。More specific examples (non-exhaustive list) of computer readable media include the following: electrical connections with one or more wiring (electronic devices), portable computer disk cartridges (magnetic devices), random access memory (RAM), Read Only Memory (ROM), Erasable Editable Read Only Memory (EPROM or Flash Memory), Fiber Optic Devices, and Portable Compact Disc Read Only Memory (CDROM). In addition, the computer readable medium may even be paper or other suitable medium on which the program may be printed, as the paper or other medium may be optically scanned, for example, followed by editing, interpretation, or other suitable medium as necessary process to obtain the program electronically and then store it in computer memory.

应当理解，本发明的各部分可以用硬件、软件、固件或它们的组合来实现。在上述实施方式中，多个步骤或方法可以用存储在存储器中且由合适的指令执行系统执行的软件或固件来实现。例如，如果用硬件来实现，和在另一实施方式中一样，可用本领域公知的下列技术中的任一项或他们的组合来实现：具有用于对数据信号实现逻辑功能的逻辑门电路的离散逻辑电路，具有合适的组合逻辑门电路的专用集成电路，可编程门阵列(PGA)，现场可编程门阵列(FPGA)等。It should be understood that various parts of the present invention may be implemented in hardware, software, firmware or a combination thereof. In the above-described embodiments, various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, it can be implemented by any one or a combination of the following techniques known in the art: Discrete logic circuits, application specific integrated circuits with suitable combinational logic gates, Programmable Gate Arrays (PGA), Field Programmable Gate Arrays (FPGA), etc.

在本说明书的上述描述中，参考术语“一个实施方式/实施例”、“另一实施方式/实施例”或“某些实施方式/实施例”等的描述意指结合实施方式或示例描述的具体特征、结构、材料或者特点包含于本发明的至少一个实施方式或示例中。在本说明书中，对上述术语的示意性表述不一定指的是相同的实施方式或示例。而且，描述的具体特征、结构、材料或者特点可以在任何的一个或多个实施方式或示例中以合适的方式结合。In the above description of the present specification, reference to the description of the terms "one embodiment/example", "another embodiment/example" or "certain embodiments/examples" etc. means the description in conjunction with the embodiment or example. Particular features, structures, materials, or characteristics are included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

尽管已经示出和描述了本发明的实施方式，本领域的普通技术人员可以理解：在不脱离本发明的原理和宗旨的情况下可以对这些实施方式进行多种变化、修改、替换和变型，本发明的范围由权利要求及其等同物限定。Although embodiments of the present invention have been shown and described, it will be understood by those of ordinary skill in the art that various changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, The scope of the invention is defined by the claims and their equivalents.

以上是对本发明的较佳实施进行了具体说明，但本发明并不限于上述实施例，熟悉本领域的技术人员在不违背本发明精神的前提下还可做作出种种的等同变形或替换，这些等同的变形或替换均包含在本申请权利要求所限定的范围内。The above is a specific description of the preferred implementation of the present invention, but the present invention is not limited to the above-mentioned embodiments, and those skilled in the art can also make various equivalent deformations or replacements on the premise of not violating the spirit of the present invention. Equivalent modifications or substitutions are included within the scope defined by the claims of the present application.

Claims

1. a zero-sample Chinese word recognition method, is characterized in that, comprises the following steps:

Extract visual features of Chinese single-character images;

Carry out learnable category coding for Chinese word categories, use the depth-first search algorithm to decompose the component structure of Chinese single words, and calculate the learnable category coding;

The category encoding of the Chinese word is mapped to the visual space, based on the mapping module of the fully connected layer, so that the dimension of the category encoding of the Chinese word is equal to the dimension of the visual space, and the reconstruction loss function is used to constrain the category encoding before and after the mapping. semantic consistency;

Through the transformer-based decoder, the category encoding of the Chinese word and the visual features of the image are matched, and the features related to the category encoding are obtained from the visual features of the image, and finally the recognition result of the Chinese word is decoded and output.

2. a kind of zero-sample Chinese word recognition method according to claim 1, is characterized in that, the visual feature of described extraction Chinese word image, comprises:

An image encoder based on a densely connected convolutional neural network is used to extract the visual features of the Chinese single-character images.

3. a kind of zero-sample Chinese word recognition method according to claim 2, is characterized in that, described image encoder adopts DenseNet121 model as backbone network, is used for extracting the visual feature of image;

The backbone network adopts an 8-fold downsampling method. In order to make the output visual features better match the category codes, the backbone network removes the last output activation layer and global average pooling layer.

4. a kind of zero-sample Chinese word recognition method according to claim 1, it is characterised in that the described Chinese word category is carried out to learnable category coding, adopts the algorithm of depth-first search, the component structure of Chinese word is decomposed , and calculates the learnable category codes, including:

According to the Chinese ideographic sequence dictionary, the component sequence of the decomposed Chinese word is obtained through the depth-first search algorithm, and the component sequence is represented as a tree data structure, and the depth information and relative position information of each component are obtained; wherein, the depth information It represents the depth of the component in the tree, and the relative position information represents the location of the component relative to its parent node;

The learnable category code corresponding to each Chinese word is calculated, and the calculation process is expressed as formula (1):

Among them, _i represents a component in the component sequence R, li represents the depth information of the component, γ _i represents the relative position information of the component, α and β are learnable parameters, _yi is the one of the component -hot encoding;

The learnable category code obtained by calculation and the depth information and relative position information of each component are spliced in dimension to obtain the final learnable category code. The calculation process is expressed as formula (2):

in,

and

Represents the normalized depth information and relative position information,

Represents the concatenation operation.

5. a kind of zero-sample Chinese word recognition method according to claim 1, is characterized in that, described mapping module based on fully connected layer is formed by a fully connected layer; The output element of described fully connected layer is input element Obtained by linear operation;

The fully connected layer maps the category codes of Chinese words into the visual space, so that the dimension of the category codes is equal to the dimension of the visual space.

6. A zero-sample Chinese word recognition method according to claim 5, wherein the reconstruction loss function is used to calculate the mean square error of the category coding before and after the mapping, and the calculation process is expressed as shown in formula (3) :

where L _re is the reconstruction loss function,

7. a kind of zero-sample Chinese word recognition method according to claim 1, is characterized in that, described concrete operation based on transformer decoder comprises:

The multi-head attention mechanism is used to match the category encoding of Chinese words and the visual features of the image, and obtain the features related to the category encoding from the visual features of the image. The calculation process is expressed as formula (4):

MultiHead(Q,K,V)=Concat(head ₁ ,...,head _h )W ^O

Among them, multi-head attention is realized by MultiHead(Q,K,V), attention is calculated by Attention(Q,K,V), Q represents the category encoding of Chinese words, K and V represent the visual features of the image ,

are all learnable projection matrices, d _k represents the dimensions of Q, K, and V, and W ^O represents the parameters of multi-head attention;

After obtaining the features related to the category coding, the features are decoded by the feedforward neural network. The feedforward neural network consists of three fully connected layers, and finally outputs the recognition results of Chinese words; in the feedforward neural network training stage, The cross-entropy loss function is adopted as the optimization target of the network, and the expression of the cross-entropy loss function is:

where pi is the label probability for class _i , qi is the predicted probability for class _i , and k is the total number of classes.

8. A zero-sample Chinese word recognition system, characterized in that, comprising:

The feature extraction module is used to extract the visual features of Chinese single-character images;

The category encoding module is used to perform learnable category encoding for Chinese word categories, using a depth-first search algorithm to decompose the component structure of Chinese words, and calculate the learnable category encoding;

The information mapping module is used to map the category encoding of the Chinese word into the visual space, based on the mapping module of the fully connected layer, so that the dimension of the category encoding of the Chinese word is equal to the dimension of the visual space, and is constrained by the reconstruction loss function Semantic consistency of category encoding before and after mapping;

The information matching module is used to match the category encoding of Chinese words and the visual features of the image through the transformer-based decoder, obtain the features related to the category encoding from the visual features of the image, and finally decode and output the recognition result of the Chinese word.

9. A zero-sample Chinese word recognition device, characterized in that, comprising:

at least one processor;

at least one memory for storing at least one program;

When the at least one program is executed by the at least one processor, the at least one processor implements the method of any one of claims 1-7.

10. A computer-readable storage medium, wherein a program executable by a processor is stored, wherein the program executable by the processor, when executed by the processor, is used to execute any one of claims 1-7 the method.