CN112632911B

CN112632911B - Chinese character coding method based on character embedding

Info

Publication number: CN112632911B
Application number: CN202110001263.4A
Authority: CN
Inventors: 柯逍; 刘童安
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2021-01-04
Filing date: 2021-01-04
Publication date: 2022-05-13
Anticipated expiration: 2041-01-04
Also published as: CN112632911A

Abstract

The invention relates to a Chinese character coding method based on character embedding, which comprises the following steps: step S1: constructing a Chinese character set, decomposing each character into a plurality of substructures, constructing a substructure set, defining the contribution degree of each substructure to the character, and constructing a substructure contribution degree matrix to each character according to the substructure set; step S2: constructing a substructure embedding matrix and training according to the obtained substructure set and the contribution matrix of the substructure to each character, and extracting to obtain a character embedding matrix; step S3: inputting characters, and acquiring character embedding through a character embedding matrix. The invention can effectively reduce the dimension of Chinese character coding, so that the Chinese character coding with similar structure has positive correlation, and effectively improves the character recognition efficiency.

Description

Chinese character coding method based on character embedding

Technical Field

The invention relates to the field of pattern recognition and computer vision, in particular to a Chinese character coding method based on character embedding.

Background

Language is one of the main ways that humans transmit information, and words are written language, which is also one of the most widespread ways that humans transmit information visually.

With the rapid development of technologies such as artificial intelligence, internet and the like, the automatic recognition of texts in images by using a computer is of great significance. For the task of character recognition, characters are usually coded by a one-hot coding mode, the coding mode ignores the correlation among similar characters and is sparse, and for the task of recognizing English characters and numbers, the applicability is still good due to the fact that the number of categories is small. However, for the task of recognizing Chinese characters, because of the various categories of Chinese characters, there are thousands of common characters, which results in slower network convergence by using unique hot coding, and completely ignores the structural shape similarity between Chinese characters, resulting in low accuracy and low efficiency of character recognition.

Disclosure of Invention

In view of the above, the present invention provides a method for encoding chinese characters based on character embedding, which can effectively reduce the dimensionality of chinese character encoding, so that chinese character encoding with similar structures has positive correlation, and effectively improve character recognition efficiency.

In order to achieve the purpose, the invention adopts the following technical scheme:

a Chinese character coding method based on character embedding comprises the following steps:

step S1: constructing a Chinese character set, decomposing each character into a plurality of substructures, constructing a substructure set, defining the contribution degree of each substructure to the character, and constructing a substructure contribution degree matrix to each character according to the substructure set;

step S2: constructing a substructure embedding matrix and training according to the obtained substructure set and the contribution matrix of the substructure to each character, and extracting to obtain a character embedding matrix;

step S3: inputting characters, and acquiring character embedding through a character embedding matrix.

Further, the step S1 is specifically:

step S11, determining the character set to be coded, the ia th Chinese character is char_iaIn total, n_charsIf a Chinese character needs to be embedded, the character set is chars ═ char_ia|ia＝1，2，...，n_chars}；

Step S12, all Chinese characters in char are split to obtain the partial of all substructures ═ { part }_ib|ib＝1，2，...，n_parts}, where part_ibIs the ib-th substructure, n_partsNumber of elements that are parts;

step S13, calculating nfreq of substructure frequency table_parts＝{nfreq_ib|ib＝1，2，...，n_partsWherein nfreq_ibDenotes part_ibIs nfreq_ibA substructure of individual characters;

step S14: because the split result is character split when k is 1, chars is a subset of parts, and a mapping relation g is established, so that part is formed_ib＝part_g(ia)；

Step S15: calculating the contribution degree of each substructure in parts to each character in chars to obtain n_partsLine n_charsThe contribution matrix charparts of the column.

Further, the step S12 is specifically:

(1) presetting that each Chinese character can be split into k substructures;

(2) k is an integer not less than 1, and when k is 1, the split result is a character per se;

(3) the maximum value of k being the number of strokes of a character or k_max，k_maxA maximum manually set split number;

splitting all Chinese characters in char according to (1) - (3) to obtain all substructures of parts ═ part_ib|ib＝1，2，...，n_parts}, where part_ibIs the ib-th substructure, n_partsIs the number of elements of parts.

Further, the step S15 is specifically:

(1) when a Chinese character is split into k parts, the contribution degree of the split sub-structure to the character is

(2) When one substructure appears in a plurality of splitting results of one character at the same time, calculating the contribution degree by taking the splitting method with the minimum k;

(3) if a sub-structure cannot form a character, the contribution degree of the sub-structure to the character is 0;

calculating the contribution degree of each substructure in parts to each character in chars according to (1) - (3) to obtain n_partsLine n_charsThe contribution matrix charparts of the column.

Further, the step S2 is specifically:

step S21: construction of a pair of sub-structural Embedded matrices embs1, embs2, embs1 and embs2 all n_partsA matrix with m rows and m columns, wherein m is a vector dimension obtained by embedding which is manually set;

step S22: if each substructure in parts is encoded uniquely, then part_ibIs encoded as ponehot_ibThen the one-hot coding of all substructures is ponehots ═ ponehot { (ponehot)_ib|ib＝1，2，…，n_parts}；

Step S23: for the ib-th substructure, ponehot_ibWith probability f (nfreq)_ib) As the central substructure, the probability calculation method is as follows:

wherein min is a minimum function, alpha is a parameter set manually, then a window with the size of t is set, t is a positive integer parameter set manually, the distribution of the ib-th row of charparts is used as the probability distribution of characters, t characters are extracted, the character numbers are mapped to the substructure numbers by mapping g and are placed in the window to be used as related substructures, r substructures are extracted randomly to be used as unrelated substructures, and r is the positive integer parameter set manually;

step S24: the computation of embedding the one-hot code into the vector by the sub-structure embedding matrix is as follows:

emb＝ponehot×embs_parts

wherein the embs_partsEmbedding a matrix for a substructure, using ponehot as a unique hot code of the substructure, using emb as an embedded vector, and embedding the unique hot code of the central substructure into an embedded vector emb1 through embs 1;

step S25: the one-hot coding of t related substructures is embedded by embs2 to obtain t embedded vectors emb2ps ═ emb2p_ic1, 2, …, t }, where emb2p_icThe ith of the t embedded vectors;

step S26: the one-hot encoding of r unrelated substructures is embedded by embs2 to obtain t embedded vectors emb2ns ═ { emb2n_id1, 2., r }, where emb2n_idIs the id-th of the r embedded vectors;

step S27: loss is calculated and the network is optimized using the following formula:

wherein ∑_icA summation symbol, Σ, representing the traversal ic 1, 2_idA summation symbol representing the traversal id 1, 2, …, r,

is emb2p_icThe transpose of (a) is performed,

is emb2n_idTranspose of (3), the expression of logsigmoid function is as follows:

wherein x is an independent variable, e is a natural constant, and log is a logarithmic function with e as a base;

step S28: based on steps S23-S27, go through ib ═ 1, 2_partsEmbedding the embs1 into the matrix as a trained substructure for a plurality of times until the network converges;

step S29: extracting a character embedding matrix embschar from the embs1 through a mapping relation g, wherein the line ia of the embschar corresponds to the line g (ia) of the embs1, and extracting a character-independent-hot-coding table conehots from ponehots through the mapping relation g_ia|ia＝1，2，...，n_charsTherein conhot_ia＝ponehot_g(ia)。

Further, the step S3 is specifically:

step S31: selecting a Chinese character to be coded;

step S32: using the condhoss to code the Chinese characters to be coded into one-hot codes;

step S33: the one-hot encoding is embedded as a low-dimensional vector using embschar.

Compared with the prior art, the invention has the following beneficial effects:

the invention can effectively reduce the dimension of Chinese character coding, enables the Chinese character coding with similar structure to have positive correlation, and effectively improves the character recognition efficiency

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Detailed Description

The invention is further explained below with reference to the drawings and the embodiments.

Referring to fig. 1, the present invention provides a method for encoding chinese characters based on character embedding, comprising the following steps:

step S1: constructing a Chinese character set, decomposing each character into a plurality of substructures, constructing a substructure set, defining the contribution of each substructure to the character, and constructing a contribution matrix of the substructures to each character according to the substructure set;

In this embodiment, the step S1 specifically includes:

Step S12, (1) presetting that each Chinese character can be split into k substructures;

Step S13, calculating nfreq of substructure frequency table_parts＝{nfreq_ib|ib＝1，2，...，n_partsWherein nfreq_ibRepresenting part_ibIs nfreq_ibA substructure of individual characters;

Step S15: (1) when a Chinese character is split into k parts, the contribution degree of the split sub-structure to the character is

In this embodiment, the step S2 specifically includes:

step S22: if each substructure in parts is encoded uniquely, then part_ibIs encoded as ponehot_ibThen the one-hot coding of all substructures is ponehots ═ ponehot { (ponehot)_ib|ib＝1，2，...，n_parts}；

emb＝ponehot×embs_parts

wherein the embs_partsEmbedding a matrix for a substructure, namely, ponehot is the single-hot coding of the substructure, emb is an embedded vector, and embedding the single-hot coding of the central substructure into an embedded vector emb1 through embs 1;

step S25: the one-hot coding of t related substructures is embedded by the embs2 to obtain t embedded vectors emb2ps ═ { emb2p_ic1, 2, …, t }, where emb2p_icThe ith of the t embedded vectors;

wherein ∑_icA summation symbol, Σ, representing the traversal ic 1, 2_idA summation symbol representing the traversal id 1, 2., r,

is emb2p_icThe transpose of (a) is performed,

In this embodiment, the step S3 specifically includes:

step S31: selecting a Chinese character to be coded;

The above description is only a preferred embodiment of the present invention, and all equivalent changes and modifications made in accordance with the claims of the present invention should be covered by the present invention.

Claims

1. A Chinese character coding method based on character embedding is characterized by comprising the following steps:

step S3: inputting characters, and acquiring character embedding through a character embedding matrix;

the step S1 specifically includes:

step S11, determining the character set to be coded, the ia-th Chinese character is char_iaIn total, n_charsIf a Chinese character needs to be embedded, the character set is chars ═ char_ia|ia＝1，2，...，n_chars}；

Step S12, all Chinese characters in char are split to obtain the partial of all substructures ═ { part }_ib|ib＝1，2，...，n_partsH, part therein_ibIs the ib-th substructure, n_partsNumber of elements that are parts;

let each Chinese character be able to be split into k substructures, the contribution degree of the split substructures to the character is

If a sub-structure cannot form a character, the contribution degree of the sub-structure to the character is 0;

Step S15: calculating the contribution degree of each substructure in parts to each character in chars to obtain n_partsLine n_charsA contribution matrix charparts of the columns;

the step S2 specifically includes:

emb＝ponehot×embs_parts

step S25: the one-hot coding of t related substructures is embedded by the embs2 to obtain t embedded vectors emb2ps ═ { emb2p_ic1, 2.., t }, where emb2p_icThe ith of the t embedded vectors;

is emb2p_icThe transpose of (a) is performed,

2. The method for encoding chinese characters based on character embedding of claim 1, wherein said step S12 specifically comprises:

(1) presetting that each Chinese character can be split into k substructures;

3. The method for encoding chinese characters based on character embedding of claim 2, wherein said step S15 specifically comprises:

(2) When a substructure appears in a plurality of splitting results of a character at the same time, calculating the contribution degree by taking the splitting method with the minimum k;

calculating the contribution degree of each substructure in parts to each character in chars according to (1) - (3) to obtain n_partsLine n_charsThe contribution matrix charparts of the columns.

4. The method for encoding chinese characters based on character embedding of claim 1, wherein said step S3 specifically comprises:

step S31: selecting a Chinese character to be coded;