CN112632911B - Chinese character coding method based on character embedding - Google Patents
Chinese character coding method based on character embedding Download PDFInfo
- Publication number
- CN112632911B CN112632911B CN202110001263.4A CN202110001263A CN112632911B CN 112632911 B CN112632911 B CN 112632911B CN 202110001263 A CN202110001263 A CN 202110001263A CN 112632911 B CN112632911 B CN 112632911B
- Authority
- CN
- China
- Prior art keywords
- character
- substructure
- parts
- embedding
- matrix
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/126—Character encoding
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Image Processing (AREA)
Abstract
The invention relates to a Chinese character coding method based on character embedding, which comprises the following steps: step S1: constructing a Chinese character set, decomposing each character into a plurality of substructures, constructing a substructure set, defining the contribution degree of each substructure to the character, and constructing a substructure contribution degree matrix to each character according to the substructure set; step S2: constructing a substructure embedding matrix and training according to the obtained substructure set and the contribution matrix of the substructure to each character, and extracting to obtain a character embedding matrix; step S3: inputting characters, and acquiring character embedding through a character embedding matrix. The invention can effectively reduce the dimension of Chinese character coding, so that the Chinese character coding with similar structure has positive correlation, and effectively improves the character recognition efficiency.
Description
Technical Field
The invention relates to the field of pattern recognition and computer vision, in particular to a Chinese character coding method based on character embedding.
Background
Language is one of the main ways that humans transmit information, and words are written language, which is also one of the most widespread ways that humans transmit information visually.
With the rapid development of technologies such as artificial intelligence, internet and the like, the automatic recognition of texts in images by using a computer is of great significance. For the task of character recognition, characters are usually coded by a one-hot coding mode, the coding mode ignores the correlation among similar characters and is sparse, and for the task of recognizing English characters and numbers, the applicability is still good due to the fact that the number of categories is small. However, for the task of recognizing Chinese characters, because of the various categories of Chinese characters, there are thousands of common characters, which results in slower network convergence by using unique hot coding, and completely ignores the structural shape similarity between Chinese characters, resulting in low accuracy and low efficiency of character recognition.
Disclosure of Invention
In view of the above, the present invention provides a method for encoding chinese characters based on character embedding, which can effectively reduce the dimensionality of chinese character encoding, so that chinese character encoding with similar structures has positive correlation, and effectively improve character recognition efficiency.
In order to achieve the purpose, the invention adopts the following technical scheme:
a Chinese character coding method based on character embedding comprises the following steps:
step S1: constructing a Chinese character set, decomposing each character into a plurality of substructures, constructing a substructure set, defining the contribution degree of each substructure to the character, and constructing a substructure contribution degree matrix to each character according to the substructure set;
step S2: constructing a substructure embedding matrix and training according to the obtained substructure set and the contribution matrix of the substructure to each character, and extracting to obtain a character embedding matrix;
step S3: inputting characters, and acquiring character embedding through a character embedding matrix.
Further, the step S1 is specifically:
step S11, determining the character set to be coded, the ia th Chinese character is chariaIn total, ncharsIf a Chinese character needs to be embedded, the character set is chars ═ charia|ia=1,2,...,nchars};
Step S12, all Chinese characters in char are split to obtain the partial of all substructures ═ { part }ib|ib=1,2,...,nparts}, where partibIs the ib-th substructure, npartsNumber of elements that are parts;
step S13, calculating nfreq of substructure frequency tableparts={nfreqib|ib=1,2,...,npartsWherein nfreqibDenotes partibIs nfreqibA substructure of individual characters;
step S14: because the split result is character split when k is 1, chars is a subset of parts, and a mapping relation g is established, so that part is formedib=partg(ia);
Step S15: calculating the contribution degree of each substructure in parts to each character in chars to obtain npartsLine ncharsThe contribution matrix charparts of the column.
Further, the step S12 is specifically:
(1) presetting that each Chinese character can be split into k substructures;
(2) k is an integer not less than 1, and when k is 1, the split result is a character per se;
(3) the maximum value of k being the number of strokes of a character or kmax,kmaxA maximum manually set split number;
splitting all Chinese characters in char according to (1) - (3) to obtain all substructures of parts ═ partib|ib=1,2,...,nparts}, where partibIs the ib-th substructure, npartsIs the number of elements of parts.
Further, the step S15 is specifically:
(1) when a Chinese character is split into k parts, the contribution degree of the split sub-structure to the character is
(2) When one substructure appears in a plurality of splitting results of one character at the same time, calculating the contribution degree by taking the splitting method with the minimum k;
(3) if a sub-structure cannot form a character, the contribution degree of the sub-structure to the character is 0;
calculating the contribution degree of each substructure in parts to each character in chars according to (1) - (3) to obtain npartsLine ncharsThe contribution matrix charparts of the column.
Further, the step S2 is specifically:
step S21: construction of a pair of sub-structural Embedded matrices embs1, embs2, embs1 and embs2 all npartsA matrix with m rows and m columns, wherein m is a vector dimension obtained by embedding which is manually set;
step S22: if each substructure in parts is encoded uniquely, then partibIs encoded as ponehotibThen the one-hot coding of all substructures is ponehots ═ ponehot { (ponehot)ib|ib=1,2,…,nparts};
Step S23: for the ib-th substructure, ponehotibWith probability f (nfreq)ib) As the central substructure, the probability calculation method is as follows:
wherein min is a minimum function, alpha is a parameter set manually, then a window with the size of t is set, t is a positive integer parameter set manually, the distribution of the ib-th row of charparts is used as the probability distribution of characters, t characters are extracted, the character numbers are mapped to the substructure numbers by mapping g and are placed in the window to be used as related substructures, r substructures are extracted randomly to be used as unrelated substructures, and r is the positive integer parameter set manually;
step S24: the computation of embedding the one-hot code into the vector by the sub-structure embedding matrix is as follows:
emb=ponehot×embsparts
wherein the embspartsEmbedding a matrix for a substructure, using ponehot as a unique hot code of the substructure, using emb as an embedded vector, and embedding the unique hot code of the central substructure into an embedded vector emb1 through embs 1;
step S25: the one-hot coding of t related substructures is embedded by embs2 to obtain t embedded vectors emb2ps ═ emb2pic1, 2, …, t }, where emb2picThe ith of the t embedded vectors;
step S26: the one-hot encoding of r unrelated substructures is embedded by embs2 to obtain t embedded vectors emb2ns ═ { emb2nid1, 2., r }, where emb2nidIs the id-th of the r embedded vectors;
step S27: loss is calculated and the network is optimized using the following formula:
wherein ∑icA summation symbol, Σ, representing the traversal ic 1, 2idA summation symbol representing the traversal id 1, 2, …, r,is emb2picThe transpose of (a) is performed,is emb2nidTranspose of (3), the expression of logsigmoid function is as follows:
wherein x is an independent variable, e is a natural constant, and log is a logarithmic function with e as a base;
step S28: based on steps S23-S27, go through ib ═ 1, 2partsEmbedding the embs1 into the matrix as a trained substructure for a plurality of times until the network converges;
step S29: extracting a character embedding matrix embschar from the embs1 through a mapping relation g, wherein the line ia of the embschar corresponds to the line g (ia) of the embs1, and extracting a character-independent-hot-coding table conehots from ponehots through the mapping relation gia|ia=1,2,...,ncharsTherein conhotia=ponehotg(ia)。
Further, the step S3 is specifically:
step S31: selecting a Chinese character to be coded;
step S32: using the condhoss to code the Chinese characters to be coded into one-hot codes;
step S33: the one-hot encoding is embedded as a low-dimensional vector using embschar.
Compared with the prior art, the invention has the following beneficial effects:
the invention can effectively reduce the dimension of Chinese character coding, enables the Chinese character coding with similar structure to have positive correlation, and effectively improves the character recognition efficiency
Drawings
FIG. 1 is a flow chart of the method of the present invention.
Detailed Description
The invention is further explained below with reference to the drawings and the embodiments.
Referring to fig. 1, the present invention provides a method for encoding chinese characters based on character embedding, comprising the following steps:
step S1: constructing a Chinese character set, decomposing each character into a plurality of substructures, constructing a substructure set, defining the contribution of each substructure to the character, and constructing a contribution matrix of the substructures to each character according to the substructure set;
step S2: constructing a substructure embedding matrix and training according to the obtained substructure set and the contribution matrix of the substructure to each character, and extracting to obtain a character embedding matrix;
step S3: inputting characters, and acquiring character embedding through a character embedding matrix.
In this embodiment, the step S1 specifically includes:
step S11, determining the character set to be coded, the ia th Chinese character is chariaIn total, ncharsIf a Chinese character needs to be embedded, the character set is chars ═ charia|ia=1,2,...,nchars};
Step S12, (1) presetting that each Chinese character can be split into k substructures;
(2) k is an integer not less than 1, and when k is 1, the split result is a character per se;
(3) the maximum value of k being the number of strokes of a character or kmax,kmaxA maximum manually set split number;
splitting all Chinese characters in char according to (1) - (3) to obtain all substructures of parts ═ partib|ib=1,2,...,nparts}, where partibIs the ib-th substructure, npartsIs the number of elements of parts.
Step S13, calculating nfreq of substructure frequency tableparts={nfreqib|ib=1,2,...,npartsWherein nfreqibRepresenting partibIs nfreqibA substructure of individual characters;
step S14: because the split result is character split when k is 1, chars is a subset of parts, and a mapping relation g is established, so that part is formedib=partg(ia);
Step S15: (1) when a Chinese character is split into k parts, the contribution degree of the split sub-structure to the character is
(2) When one substructure appears in a plurality of splitting results of one character at the same time, calculating the contribution degree by taking the splitting method with the minimum k;
(3) if a sub-structure cannot form a character, the contribution degree of the sub-structure to the character is 0;
calculating the contribution degree of each substructure in parts to each character in chars according to (1) - (3) to obtain npartsLine ncharsThe contribution matrix charparts of the column.
In this embodiment, the step S2 specifically includes:
step S21: construction of a pair of sub-structural Embedded matrices embs1, embs2, embs1 and embs2 all npartsA matrix with m rows and m columns, wherein m is a vector dimension obtained by embedding which is manually set;
step S22: if each substructure in parts is encoded uniquely, then partibIs encoded as ponehotibThen the one-hot coding of all substructures is ponehots ═ ponehot { (ponehot)ib|ib=1,2,...,nparts};
Step S23: for the ib-th substructure, ponehotibWith probability f (nfreq)ib) As the central substructure, the probability calculation method is as follows:
wherein min is a minimum function, alpha is a parameter set manually, then a window with the size of t is set, t is a positive integer parameter set manually, the distribution of the ib-th row of charparts is used as the probability distribution of characters, t characters are extracted, the character numbers are mapped to the substructure numbers by mapping g and are placed in the window to be used as related substructures, r substructures are extracted randomly to be used as unrelated substructures, and r is the positive integer parameter set manually;
step S24: the computation of embedding the one-hot code into the vector by the sub-structure embedding matrix is as follows:
emb=ponehot×embsparts
wherein the embspartsEmbedding a matrix for a substructure, namely, ponehot is the single-hot coding of the substructure, emb is an embedded vector, and embedding the single-hot coding of the central substructure into an embedded vector emb1 through embs 1;
step S25: the one-hot coding of t related substructures is embedded by the embs2 to obtain t embedded vectors emb2ps ═ { emb2pic1, 2, …, t }, where emb2picThe ith of the t embedded vectors;
step S26: the one-hot encoding of r unrelated substructures is embedded by embs2 to obtain t embedded vectors emb2ns ═ { emb2nid1, 2., r }, where emb2nidIs the id-th of the r embedded vectors;
step S27: loss is calculated and the network is optimized using the following formula:
wherein ∑icA summation symbol, Σ, representing the traversal ic 1, 2idA summation symbol representing the traversal id 1, 2., r,is emb2picThe transpose of (a) is performed,is emb2nidTranspose of (3), the expression of logsigmoid function is as follows:
wherein x is an independent variable, e is a natural constant, and log is a logarithmic function with e as a base;
step S28: based on steps S23-S27, go through ib ═ 1, 2partsEmbedding the embs1 into the matrix as a trained substructure for a plurality of times until the network converges;
step S29: extracting a character embedding matrix embschar from the embs1 through a mapping relation g, wherein the line ia of the embschar corresponds to the line g (ia) of the embs1, and extracting a character-independent-hot-coding table conehots from ponehots through the mapping relation gia|ia=1,2,...,ncharsTherein conhotia=ponehotg(ia)。
In this embodiment, the step S3 specifically includes:
step S31: selecting a Chinese character to be coded;
step S32: using the condhoss to code the Chinese characters to be coded into one-hot codes;
step S33: the one-hot encoding is embedded as a low-dimensional vector using embschar.
The above description is only a preferred embodiment of the present invention, and all equivalent changes and modifications made in accordance with the claims of the present invention should be covered by the present invention.
Claims (4)
1. A Chinese character coding method based on character embedding is characterized by comprising the following steps:
step S1: constructing a Chinese character set, decomposing each character into a plurality of substructures, constructing a substructure set, defining the contribution degree of each substructure to the character, and constructing a substructure contribution degree matrix to each character according to the substructure set;
step S2: constructing a substructure embedding matrix and training according to the obtained substructure set and the contribution matrix of the substructure to each character, and extracting to obtain a character embedding matrix;
step S3: inputting characters, and acquiring character embedding through a character embedding matrix;
the step S1 specifically includes:
step S11, determining the character set to be coded, the ia-th Chinese character is chariaIn total, ncharsIf a Chinese character needs to be embedded, the character set is chars ═ charia|ia=1,2,...,nchars};
Step S12, all Chinese characters in char are split to obtain the partial of all substructures ═ { part }ib|ib=1,2,...,npartsH, part thereinibIs the ib-th substructure, npartsNumber of elements that are parts;
step S13, calculating nfreq of substructure frequency tableparts={nfreqib|ib=1,2,...,npartsWherein nfreqibDenotes partibIs nfreqibA substructure of individual characters;
let each Chinese character be able to be split into k substructures, the contribution degree of the split substructures to the character isIf a sub-structure cannot form a character, the contribution degree of the sub-structure to the character is 0;
step S14: because the split result is character split when k is 1, chars is a subset of parts, and a mapping relation g is established, so that part is formedib=partg(ia);
Step S15: calculating the contribution degree of each substructure in parts to each character in chars to obtain npartsLine ncharsA contribution matrix charparts of the columns;
the step S2 specifically includes:
step S21: construction of a pair of sub-structural Embedded matrices embs1, embs2, embs1 and embs2 all npartsA matrix with m rows and m columns, wherein m is a vector dimension obtained by embedding which is manually set;
step S22: if each substructure in parts is encoded uniquely, then partibIs encoded as ponehotibThen the one-hot coding of all substructures is ponehots ═ ponehot { (ponehot)ib|ib=1,2,...,nparts};
Step S23: for the ib-th substructure, ponehotibWith probability f (nfreq)ib) As the central substructure, the probability calculation method is as follows:
wherein min is a minimum function, alpha is a parameter set manually, then a window with the size of t is set, t is a positive integer parameter set manually, the distribution of the ib-th row of charparts is used as the probability distribution of characters, t characters are extracted, the character numbers are mapped to the substructure numbers by mapping g and are placed in the window to be used as related substructures, r substructures are extracted randomly to be used as unrelated substructures, and r is the positive integer parameter set manually;
step S24: the computation of embedding the one-hot code into the vector by the sub-structure embedding matrix is as follows:
emb=ponehot×embsparts
wherein the embspartsEmbedding a matrix for a substructure, using ponehot as a unique hot code of the substructure, using emb as an embedded vector, and embedding the unique hot code of the central substructure into an embedded vector emb1 through embs 1;
step S25: the one-hot coding of t related substructures is embedded by the embs2 to obtain t embedded vectors emb2ps ═ { emb2pic1, 2.., t }, where emb2picThe ith of the t embedded vectors;
step S26: the one-hot encoding of r unrelated substructures is embedded by embs2 to obtain t embedded vectors emb2ns ═ { emb2nid1, 2., r }, where emb2nidIs the id-th of the r embedded vectors;
step S27: loss is calculated and the network is optimized using the following formula:
wherein ∑icA summation symbol, Σ, representing the traversal ic 1, 2idA summation symbol representing the traversal id 1, 2., r,is emb2picThe transpose of (a) is performed,is emb2nidTranspose of (3), the expression of logsigmoid function is as follows:
wherein x is an independent variable, e is a natural constant, and log is a logarithmic function with e as a base;
step S28: based on steps S23-S27, go through ib ═ 1, 2partsEmbedding the embs1 into the matrix as a trained substructure for a plurality of times until the network converges;
step S29: extracting a character embedding matrix embschar from the embs1 through a mapping relation g, wherein the line ia of the embschar corresponds to the line g (ia) of the embs1, and extracting a character-independent-hot-coding table conehots from ponehots through the mapping relation gia|ia=1,2,...,ncharsTherein conhotia=ponehotg(ia)。
2. The method for encoding chinese characters based on character embedding of claim 1, wherein said step S12 specifically comprises:
(1) presetting that each Chinese character can be split into k substructures;
(2) k is an integer not less than 1, and when k is 1, the split result is a character per se;
(3) the maximum value of k being the number of strokes of a character or kmax,kmaxA maximum manually set split number;
splitting all Chinese characters in char according to (1) - (3) to obtain all substructures of parts ═ partib|ib=1,2,...,nparts}, where partibIs the ib-th substructure, npartsIs the number of elements of parts.
3. The method for encoding chinese characters based on character embedding of claim 2, wherein said step S15 specifically comprises:
(1) when a Chinese character is split into k parts, the contribution degree of the split sub-structure to the character is
(2) When a substructure appears in a plurality of splitting results of a character at the same time, calculating the contribution degree by taking the splitting method with the minimum k;
(3) if a sub-structure cannot form a character, the contribution degree of the sub-structure to the character is 0;
calculating the contribution degree of each substructure in parts to each character in chars according to (1) - (3) to obtain npartsLine ncharsThe contribution matrix charparts of the columns.
4. The method for encoding chinese characters based on character embedding of claim 1, wherein said step S3 specifically comprises:
step S31: selecting a Chinese character to be coded;
step S32: using the condhoss to code the Chinese characters to be coded into one-hot codes;
step S33: the one-hot encoding is embedded as a low-dimensional vector using embschar.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110001263.4A CN112632911B (en) | 2021-01-04 | 2021-01-04 | Chinese character coding method based on character embedding |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110001263.4A CN112632911B (en) | 2021-01-04 | 2021-01-04 | Chinese character coding method based on character embedding |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112632911A CN112632911A (en) | 2021-04-09 |
CN112632911B true CN112632911B (en) | 2022-05-13 |
Family
ID=75290846
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110001263.4A Active CN112632911B (en) | 2021-01-04 | 2021-01-04 | Chinese character coding method based on character embedding |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112632911B (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4327421A (en) * | 1976-05-13 | 1982-04-27 | Transtech International Corporation | Chinese printing system |
CN103544141A (en) * | 2012-07-16 | 2014-01-29 | 哈尔滨安天科技股份有限公司 | Method and system for extracting significant character strings in binary data |
CN109697285A (en) * | 2018-12-13 | 2019-04-30 | 中南大学 | Enhance the hierarchical B iLSTM Chinese electronic health record disease code mask method of semantic expressiveness |
-
2021
- 2021-01-04 CN CN202110001263.4A patent/CN112632911B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4327421A (en) * | 1976-05-13 | 1982-04-27 | Transtech International Corporation | Chinese printing system |
CN103544141A (en) * | 2012-07-16 | 2014-01-29 | 哈尔滨安天科技股份有限公司 | Method and system for extracting significant character strings in binary data |
CN109697285A (en) * | 2018-12-13 | 2019-04-30 | 中南大学 | Enhance the hierarchical B iLSTM Chinese electronic health record disease code mask method of semantic expressiveness |
Non-Patent Citations (1)
Title |
---|
脱机印刷体彝族文字识别系统的原理与实现;朱宗晓;《万方数据会议库》;20120625;第1-5页 * |
Also Published As
Publication number | Publication date |
---|---|
CN112632911A (en) | 2021-04-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109582789B (en) | Text multi-label classification method based on semantic unit information | |
US8369612B2 (en) | System and methods for Arabic text recognition based on effective Arabic text feature extraction | |
CN108171198B (en) | Continuous sign language video automatic translation method based on asymmetric multilayer LSTM | |
CN109948691B (en) | Image description generation method and device based on depth residual error network and attention | |
CN109492202B (en) | Chinese error correction method based on pinyin coding and decoding model | |
CN110209801A (en) | A kind of text snippet automatic generation method based on from attention network | |
CN109285111B (en) | Font conversion method, device, equipment and computer readable storage medium | |
Puigcerver et al. | ICDAR2015 competition on keyword spotting for handwritten documents | |
CN111753557A (en) | Chinese-more unsupervised neural machine translation method fusing EMD minimized bilingual dictionary | |
CN110196903B (en) | Method and system for generating abstract for article | |
CN111914825B (en) | Character recognition method and device and electronic equipment | |
CN111581374A (en) | Text abstract obtaining method and device and electronic equipment | |
CN111312356A (en) | Traditional Chinese medicine prescription generation method based on BERT and integration efficacy information | |
CN109255381A (en) | A kind of image classification method based on the sparse adaptive depth network of second order VLAD | |
CN101673398A (en) | Method for splitting images based on clustering of immunity sparse spectrums | |
CN112036137A (en) | Deep learning-based multi-style calligraphy digital ink simulation method and system | |
CN112632911B (en) | Chinese character coding method based on character embedding | |
CN106934458A (en) | Multilayer automatic coding and system based on deep learning | |
CN115170403A (en) | Font repairing method and system based on deep meta learning and generation countermeasure network | |
CN111797611B (en) | Antithetical couplet generation model, antithetical couplet generation method, antithetical couplet generation device, computer equipment and medium | |
Chua et al. | Unsupervised learning of patterns in data streams using compression and edit distance | |
CN116226357B (en) | Document retrieval method under input containing error information | |
CN111523325A (en) | Chinese named entity recognition method based on strokes | |
Valy et al. | Text Recognition on Khmer Historical Documents using Glyph Class Map Generation with Encoder-Decoder Model. | |
CN108921911B (en) | Method for automatically converting structured picture into source code |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |