CN110610006B

CN110610006B - Morphological double-channel Chinese word embedding method based on strokes and fonts

Info

Publication number: CN110610006B
Application number: CN201910881062.0A
Authority: CN
Inventors: 陈恩红; 刘淇; 徐童; 童世炜; 陶汉卿
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2019-09-18
Filing date: 2019-09-18
Publication date: 2023-06-20
Anticipated expiration: 2039-09-18
Also published as: CN110610006A

Abstract

The invention discloses a morphological double-channel Chinese word embedding method based on strokes and fonts, which comprises the following steps: obtaining a Chinese text, and obtaining a corresponding word sequence through preprocessing; each word in the word sequence is split into a plurality of Chinese characters, and modeling is carried out according to the stroke order information and the font picture information of the Chinese characters and aiming at the character level morphological characteristics, the character level characteristics and the extraction process of the character level characteristics, so that word embedding expression suitable for the characteristics of the Chinese characters is obtained. The method can enhance the word embedding effect and provide a certain technical support for the practice in the fields of Chinese natural language processing, text mining and the like.

Description

Morphological double-channel Chinese word embedding method based on strokes and fonts

Technical Field

The invention relates to the field of natural language processing, in particular to a morphological double-channel Chinese word embedding method based on strokes and fonts.

Background

Natural language is a complex set of systems used by humans to express and convey information. In this set of systems, the words are the basic units of the meaning. Word vectors, as their name implies, are vectors used to represent words and may also be considered as feature vectors or tokens of words. The technique of mapping words into real number domain vectors is also called word embedding. Word embedding has been a subject of extensive research as a cornerstone of natural language tasks.

In recent years, globalization of information causes text information on the internet to show explosive growth, wherein the proportion and influence of Chinese text are increasing, and natural language processing methods for Chinese, especially word embedding methods as task bases are receiving more and more attention. Chinese is a language derived from pictographs, has very rich morphological meanings, and is reflected not only in one-dimensional stroke order characteristics, but also in fonts in two-dimensional space. Recent studies have demonstrated that characterizing morphological features aids in feature capture for word embedding. Therefore, enhancing the effect of Chinese word embedding by using morphological information becomes an important issue for Chinese natural language processing tasks.

Currently, word embedding methods for chinese are either migration of methods designed for alphabetic languages represented by english, or independent characterization of morphological features of chinese, such as strokes and glyphs. The former ignores that chinese is a morpheme language, which is essentially different from alphabetic languages such as english, and therefore has poor results when applied to chinese text processing. The latter is to split morphological features, which cannot capture the features of each dimension of morphology effectively, and is therefore very limited. Therefore, how to fully utilize morphological features to enhance the effect of Chinese word embedding still has many opportunities and challenges.

Disclosure of Invention

The invention aims to provide a morphological double-channel Chinese word embedding method based on strokes and fonts, which can enhance the word embedding effect and provides a certain technical support for the practice in the fields of Chinese natural language processing, text mining and the like.

The invention aims at realizing the following technical scheme:

a morphological double-channel Chinese word embedding method based on strokes and fonts comprises the following steps:

obtaining a Chinese text, and obtaining a corresponding word sequence through preprocessing;

each word in the word sequence is split into a plurality of Chinese characters, and modeling is carried out according to the stroke order information and the font picture information of the Chinese characters and aiming at the character level morphological characteristics, the character level characteristics and the extraction process of the character level characteristics, so that word embedding expression suitable for the characteristics of the Chinese characters is obtained.

According to the technical scheme provided by the invention, the word embedding modeling is performed on the Chinese text by the morphological double-channel Chinese word embedding method based on strokes and fonts, and compared with the traditional processing method, the method can more effectively perform vector characterization on the Chinese text by means of stroke order and font information, so that richer morphological information is provided, good interpretability is achieved, and better downstream characteristic data is provided for natural language processing tasks of the Internet. The method has a certain practical application value and can bring certain potential economic benefits to some related text information platforms.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method for embedding morphological double-channel Chinese words based on strokes and fonts provided by an embodiment of the invention;

FIG. 2 is a formal descriptive diagram of morphological features of Chinese characters according to an embodiment of the present invention;

FIG. 3 is a diagram of a model framework of a method for morphological dual-channel Chinese word embedding based on strokes and fonts, which is provided by an embodiment of the invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.

As shown in FIG. 1, a morphological dual-channel Chinese word embedding method based on strokes and fonts is provided for the embodiment of the invention; as shown in fig. 2, a model frame corresponding to the method is provided. The method mainly comprises the following steps:

step 1, acquiring a Chinese text, and obtaining a corresponding word sequence through preprocessing.

In the embodiment of the invention, a specified number of Chinese text corpus data sets (used for model training, the specific number can be selected according to actual conditions) are crawled from an open-source Chinese corpus in advance; crawling the stroke order information of Chinese characters from the open source dictionary data, wherein the stroke order information comprises the Chinese characters and the stroke order (32 pictures) of the Chinese characters; and generating font image information (28X 28,1 BIT) of the Chinese characters.

To ensure the model effect, a pre-treatment is required before modeling.

1) And performing word segmentation processing on the Chinese text.

In the embodiment of the invention, the Chinese text needs to be divided into word sequences.

2) And removing the text with the word number smaller than the set value in the word segmentation result.

In the present example, it is desirable to remove some of the lower quality data. It is generally considered that texts having a smaller number of words than a set number of texts contain only a smaller amount of information and have lower quality. The set number here may be, for example, 5.

3) And removing the stop word to obtain a corresponding word sequence.

In the present example, it is desirable to remove certain stop words, such as "you", "this", etc. The stop words in the text content are generally considered to be high-frequency indicative words, have large semantic variability, are not suitable for word embedding tasks and need to be removed.

And 2, splitting each word in the word sequence into a plurality of Chinese characters, and modeling according to the stroke order information and the font information of the Chinese characters and aiming at the character level morphological characteristics, the character level characteristics and the extraction process of the character level characteristics, so as to obtain word embedding expression suitable for the characteristics of the Chinese characters.

Word embedding expression is to represent language concepts and semantic units (words) of human social abstraction by mathematical real-valued vectors, so that the real-valued vectors reflect semantic links among different words as accurately as possible. The word embedding expression with high quality is a basic stone of a downstream task in the field of natural language processing, because any natural language task (such as text classification) needs to convert the text from an abstract concept to a mathematical real value vector which can be processed by a computer, and after the real value vector representation of the text is obtained, a subsequent deep learning modeling process (each natural language downstream task corresponds to a processing process) can be performed. For a text, it is composed of words, so the representation of the text ultimately depends on the superposition of the word embedding representations (how to superpose, and in what way the superposition corresponds to the word embedding model and algorithm).

It should be noted that, the embodiment of the present invention only protects the word embedding expression extraction mode, and the specific natural language task using the extracted word embedding expression can be determined by those skilled in the art according to their own needs. For example, it may be the text classification task mentioned above, and the subsequent procedures involved may also be implemented with reference to conventional techniques.

In fact, chinese and english are subject to great differences, both in language nature and in the literal system, especially in the characterization of morphological features. However, the conventional word embedding method cannot well capture morphological characteristics of chinese. In order to model the complex morphological structure and characteristics of Chinese words, the morphological double-channel Chinese word embedding model (DWE) based on strokes and fonts is designed. As shown in fig. 2, the DWE model characterizes three granularities: word-level morphological features (stroke order, font), word-level features (fusion of stroke order features and font features), word-level features (fusion of word-level features). For each input word, firstly splitting the word into a plurality of Chinese characters, obtaining stroke sequences and font pictures of each Chinese character according to the collected stroke order and font information, and then extracting character level morphological characteristics, character level characteristics and word level characteristics step by step.

1. Word level morphological features.

The word-level morphological features mainly include: the stroke order characteristics of the one-dimensional sequence channel and the font characteristics of the two-dimensional space channel.

1) Stroke order feature of one-dimensional sequence channel

Each Chinese character can be decomposed into a definite and unchanged stroke sequence, namely the stroke sequence, and the strokes in the stroke sequence can be combined to obtain morphological components such as components, radicals and the like, which are similar to the prefix, the suffix and the root of English. These particular stroke sequences, or morphological components, may reflect inherent, shared semantics. As shown in the upper half of fig. 3, the kanji "drive" can be broken down into a sequence of eight strokes, where the last three strokes together correspond to their radical "horse", while the first two (force) and middle three (mouth) strokes characterize the original real scene of the action "drive" (pull the reins with force, while the mouth issues instructions).

In the embodiment of the invention, the extraction mode of the stroke order characteristic of the one-dimensional sequence channel in the word-level morphological characteristic mainly comprises the following steps:

for the Chinese character c, determining the stroke order of the Chinese character c according to the stroke order information of the Chinese character to obtain a corresponding stroke sequence;

setting a sliding window with the size of n to extract the subword combination of the stroke order;

adding boundary symbols < > and < > to the head and the tail of the stroke sequence of the Chinese character c to obtain a new stroke sequence;

sequentially disassembling a plurality of stroke combinations by taking n strokes as a group from front to back, and taking the stroke order added with the boundary sign as a special subword;

the sub-word combination contained in the final Chinese character c is marked as G (c).

For example, for a Chinese character "beat", if n is 1, it means that the strokes are disassembled from one picture to another, and the stroke units such as horizontal and vertical left-falling strokes, and the boundary symbols "<" and ">; if n is 2, it means that the two pictures are separated from each other, and the combination of the two pictures contains "< one", "…" one "and" one "," one > "; if n is 3, it represents the three-picture break down, which contains radicals that can express specific semantics: by the handle (bar).

The character of the stroke order is extracted by using the sub-word embedding mode, and the character of the stroke order is a 'driving' (stroke order: -fold-back type one fold-back type:)

One) is exemplified by the following method:

setting a sliding window with n=3, and adding two special boundary symbols "<" and ">" to the head and tail of the stroke order Stoke to obtain a new stroke sequence:

fold-back type fold-back

>

Sequentially disassembling a plurality of stroke combinations from front to back by taking three strokes as a group:

a fold-back, a fold-back, a back, fold-back one, fold-back one fold-back

/>

One (I)/(II)>

>

Meanwhile, the stroke order of the added boundary symbol is used as a special subword, namely:

fold-back type fold-back

>

Recording the sub word combination set in all Chinese characters as G, for any Chinese character c, the sub word combination contained in the Chinese character c is G (c), and for each sub word G epsilon G (c), giving a sub word feature vector

The value of the parameter is the parameter to be optimized, and training and optimization can be performed in the subsequent stage.

It can be understood by those skilled in the art that the boundary symbol is the boundary for distinguishing different Chinese characters as the name implies, and is additionally added with "<" and ">", for example, the Chinese character "driving" is 8 pictures, the "<" is added before the 8 picture sequence which is disassembled by the "driving", the ">" is added at the end, so that the total is 10 pictures, n is used as the window size to scan from left to right n when extracting the stroke n-gram, and when n takes different values in one value interval, a stroke n-gram set is obtained, and the set is all the sub-word combinations G (c) corresponding to the Chinese character c. n-gram is a specific noun of a sub-phrase, which is a special phrase obtained by processing the stroke order and used as a feature in a word embedding method. In the preceding paragraph, it is explained in detail how the 3-gram operates and the 3-gram sub-phrase is obtained.

2) Character-form feature

Since chinese is a morpheme language derived from oracle (a pictographic character), its spatial structure, i.e., glyphs, can also convey rich semantic information. The key reason why Chinese characters are so rich in form information is that the same strokes can convey different semantic information in different combination modes in a two-dimensional space. As shown in the lower half of fig. 3, three chinese characters, "human", "enter" and "eight", have identical sequences of strokes, but they have completely different semantics because their spatial combinations of strokes are different. The method for extracting the character form features by using the convolutional neural network is as follows:

for the Chinese character c, according to the font image information, obtaining a corresponding font image I _c Extracting the font characteristics by using a LeNet convolutional neural network:

the structure of the CNN network and related parameters are given below by way of example.

The CNN network includes: input layer, C1 layer (first convolution layer), S2 layer (first pooling layer), C3 layer (second convolution layer), S4 layer (second pooling layer), F5 layer (fully connected layer), output layer.

The parameters of each layer are as follows:

layer C1: 20 convolution kernels, each convolution kernel size 5x5;

s2 layer: maximum pooling core (MaxPooling), pool core size 2x2;

layer C3: 50 convolution kernels, each convolution kernel 5x5 in size;

s4 layer: maximum pooling core (MaxPooling), pool core size 2x2;

f5 layer: the dimension 500 is output.

2. Word level features

The morphological information in Chinese consists of two parts: the one-dimensional sequence information represented by the stroke order and the two-dimensional space information represented by the font need to be combined when the character level features are represented.

In the embodiment of the invention, the two are combined by adopting the operation of component combination, and the method is as follows:

for Chinese character c, the stroke order of the one-dimensional sequence channel is characterized by a sub-word combination G (c), and each element contains a sub-word feature vector

The two-dimensional spatial channel is characterized by a character pattern CNN (I _c ) The method comprises the steps of carrying out a first treatment on the surface of the Obtaining character level characteristic representation of Chinese character c by using component combination operation>

Where x is a component combination operator, there are various choices, such as addition and dot product.

3. Word level features

The word level features are obtained by fusing word level features: the Chinese characters in each word are accumulated and summed (N _c For the number of Chinese characters contained in each word), a word-level feature representation of the word-level features is obtained

Reunion word level characterization->

Component combination is performed to obtain word level characteristics +.>

Namely:

here, the

Representing vector addition. Through the characteristic characterization of the three granularities, the characterization vector +.>

In the training stage, optimizing and training the model by utilizing a preset crawled Chinese text corpus data set D with a specified number;

for word w, from the distribution P (typically the unigram distribution unigram distribution), a negative set of samples of size λ (typically 5) is extracted, and the final optimization objective is optimized using maximum likelihood estimation:

wherein s (w, e) represents a similarity function in the skip word model, w is a central word, e is a window background word of the central word w, T (w) is a context window word set of the central word w, lambda is the number of negative samples of each central word w, e' is a negative sample noise word obtained by negative sampling,

is the desired function term, σ is the sigmoid function.

In the embodiment of the invention, the similarity function in the skip word model is expressed as follows:

wherein ,

and->

Vector representations of word w and word e, respectively.

And after training, evaluating the performance of the model by using the test task data set.

It will be appreciated by those skilled in the art that both the training phase and the testing phase are performed in the manner described above with reference to steps 1-2.

According to the scheme provided by the embodiment of the invention, for a Chinese text, the text can be segmented and mapped into three granularity characteristics by utilizing a Chinese word segmentation tool and the stroke order and character shape characteristics of Chinese characters, namely: the character sequence comprises a character sequence contained in each character, a stroke order sequence corresponding to each character and a character pattern picture. The word sequence and the character sequence are two most commonly used features in a word embedding task, the stroke sequence and the character pattern of the characters are two very important features in Chinese morphology, the two features respectively describe one-dimensional sequence morphological features and two-dimensional space morphological features of Chinese, and more Chinese semantic information is more implicit and low-level. The low-level stroke order features and the font features are fused and then are combined upwards step by step, so that the Chinese morphological features are fused into the modeling process of word embedding, and richer language features are provided for the word embedding model.

According to the technical scheme provided by the invention, the word embedding modeling is performed on the Chinese text by using the morphological double-channel Chinese word embedding method based on strokes and fonts, and compared with the traditional processing method, the method can more effectively perform vector characterization on the Chinese text by means of stroke order and font information, thereby providing richer morphological information, good interpretability and better downstream characteristic data for natural language processing tasks of the Internet. The method has a certain practical application value and can bring certain potential economic benefits to some related text information platforms.

The above description is mainly directed to related schemes of the invention, and the following description is directed to related technologies of word embedding tasks so as to facilitate understanding of the invention.

For word embedding tasks, the goal is to represent each word in the text as a vector of fixed dimensions and to make these vectors better express the similarity and analogy relationships between different words, i.e. for two words x, y. The similarity is defined as the cosine value of the included angle between the vectorized representation x and y, namely the cosine similarity:

the upper part of the formula is denoted as similarity function s (x, y) =x·y. More formally, the task is to give a set of Chinese text data, learn and iteratively update the embedded representation of words by assuming that the words appearing simultaneously in the text data have greater similarity, so as to obtain an embedded representation of words, so that the embedded representation can have better accuracy when applied to similarity and analogy tasks. For example, by looking up a dictionary, in a similarity or analogy task, find the word vector of the word to be compared in the task, e.g., compare two words, king and queen, then find the word vectors of the two words respectively, and calculate their cosine similarity.

An expression of a character skip model is used, and a text is taken as an example, and the character skip model is marked as S. Firstly, a word segmentation tool is used for segmenting the word into word sequences T, and a 'lark flies over from the blue sky'. And setting the window size of the background word as 2 by taking the blue sky as a central word c. The problem is embodied in that, given a central word c, a conditional probability of a background word that is no more than two words away from it is generated. More specifically, each word is represented as two d-dimensional vectors, which are used to calculate the conditional probability. Assuming that a word indexes i in the dictionary, a vector is represented as if it were a center word

And the vector is expressed as ++for the background word>

Let the center word be w _c The vector is denoted +.>

The background word is w _e The vector is denoted +.>

The conditional probability of a given center word generating a background word can be obtained by softmax operation on the vector inner product:

wherein the dictionary index set

Given a text sequence of length N, let the word of time step t be w ^(t) . Assuming that the generation of the background words is independent under the condition of the given central word, when the window background size is m, the probability of generating all the background words by the given arbitrary central word is given, namely the likelihood function is:

all time steps less than 1 and greater than N are ignored. The goal is to maximize the likelihood function described above.

However, the events contained in the above model only consider positive class samples. This results in that the joint probability above is maximized to 1 when all word vectors are equal and have an infinite value. It is clear that such a word vector is meaningless. Thus, for each word w, we extract a negative set of samples T (w) of size λ (typically 5) from the distribution P (typically a unitary distribution), while the events containing positive and negative samples are independent of each other, while letting:

wherein P (d= 1|w) ^(t) ,w ^(t+j) )＝σ(s(w ^(t) ,w ^(t+j) ) Representing the background word w ^(t+j) Appear in the center word w ^(t) Is a sigmoid function

The joint probability of the positive sample to be considered to the maximum degree is rewritten into a log-likelihood function to obtain an objective function to be optimized

wherein ,

for the desired item, second item->

Equivalent to

From the description of the above embodiments, it will be apparent to those skilled in the art that the above embodiments may be implemented in software, or may be implemented by means of software plus a necessary general hardware platform. With such understanding, the technical solutions of the foregoing embodiments may be embodied in a software product, where the software product may be stored in a nonvolatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.), and include several instructions for causing a computer device (may be a personal computer, a server, or a network device, etc.) to perform the methods of the embodiments of the present invention.

The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions easily contemplated by those skilled in the art within the scope of the present invention should be included in the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims

1. A morphological double-channel Chinese word embedding method based on strokes and fonts is characterized by comprising the following steps:

splitting each word in the word sequence into a plurality of Chinese characters, and modeling according to the stroke order information and the font picture information of the Chinese characters and aiming at the character level morphological characteristics, the character level characteristics and the extraction process of the character level characteristics so as to obtain word embedding expression suitable for the characteristics of the Chinese characters;

the word-level morphological features include: the stroke order characteristic of the one-dimensional sequence channel and the character shape characteristic of the two-dimensional space channel;

the extraction mode of the stroke order characteristic of the one-dimensional sequence channel in the word-level morphological characteristic comprises the following steps: for the Chinese character c, determining the stroke order of the Chinese character c according to the stroke order information of the Chinese character to obtain a corresponding stroke sequence; setting a sliding window with the size of n to extract the subword combination of the stroke order; adding boundary symbols < > and < > to the head and the tail of the stroke sequence of the Chinese character c to obtain a new stroke sequence; sequentially disassembling a plurality of stroke combinations by taking n strokes as a group from front to back, and taking the stroke order added with the boundary sign as a special subword; the sub-word combination contained in the final Chinese character c is marked as G (c);

the extraction mode of the font characteristic of the two-dimensional space channel in the character level morphological characteristic comprises the following steps: for the Chinese character c, according to the font image information, obtaining a corresponding font image I _c Extracting the character pattern features by using a CNN network:

the character level features are obtained by fusing the stroke order features of the one-dimensional sequence channels and the font features of the two-dimensional space channels in the character level morphological features; for Chinese character c, the stroke order of the one-dimensional sequence channel is characterized by a sub-word combination G (c), and each element contains a sub-word feature vector

The two-dimensional spatial channel is characterized by a character pattern CNN (I _c ) The method comprises the steps of carrying out a first treatment on the surface of the Operation of using composition combination to obtain character level characteristic representation of Chinese character>

Wherein, is a constituent combination operator;

the word level features are obtained by fusing word level features: accumulating and summing the Chinese characters in each word to obtain word-level feature representation in word-level features

Reunion word level characterization->

Component combination is performed to obtain word level characteristics +.>

Namely:

wherein ,N_c For the number of chinese characters contained in each word,

representing vector addition;

optimizing and training the model by utilizing a preset crawled Chinese text corpus data set D with a specified number;

for word w, a negative set of samples T (w) of size λ is extracted from the distribution P, and the final optimization objective is optimized using maximum likelihood estimation:

is the desired function term, σ is the sigmoid function.

2. The method for embedding the Chinese word in the morphological dual-channel mode based on strokes and fonts according to claim 1, wherein the preprocessing mode comprises the following steps:

word segmentation processing is carried out on the Chinese text;

removing texts with the word numbers smaller than a set value in the word segmentation result;

and removing the stop word to obtain a corresponding word sequence.

3. The method for embedding the Chinese word in the morphology dual-channel mode based on strokes and fonts according to claim 1, wherein the stroke order information and the font information of the Chinese character are crawled from open source dictionary data in advance.

4. The method for embedding a Chinese word in a morphological dual-channel based on strokes and fonts according to claim 1, wherein the similarity function in the skip model is expressed as:

wherein ,

and->

Word level features of word w and word e, respectively.