CN110610006B - Morphological double-channel Chinese word embedding method based on strokes and fonts - Google Patents

Morphological double-channel Chinese word embedding method based on strokes and fonts Download PDF

Info

Publication number
CN110610006B
CN110610006B CN201910881062.0A CN201910881062A CN110610006B CN 110610006 B CN110610006 B CN 110610006B CN 201910881062 A CN201910881062 A CN 201910881062A CN 110610006 B CN110610006 B CN 110610006B
Authority
CN
China
Prior art keywords
word
chinese
character
morphological
level
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910881062.0A
Other languages
Chinese (zh)
Other versions
CN110610006A (en
Inventor
陈恩红
刘淇
徐童
童世炜
陶汉卿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology of China USTC
Original Assignee
University of Science and Technology of China USTC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology of China USTC filed Critical University of Science and Technology of China USTC
Priority to CN201910881062.0A priority Critical patent/CN110610006B/en
Publication of CN110610006A publication Critical patent/CN110610006A/en
Application granted granted Critical
Publication of CN110610006B publication Critical patent/CN110610006B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The invention discloses a morphological double-channel Chinese word embedding method based on strokes and fonts, which comprises the following steps: obtaining a Chinese text, and obtaining a corresponding word sequence through preprocessing; each word in the word sequence is split into a plurality of Chinese characters, and modeling is carried out according to the stroke order information and the font picture information of the Chinese characters and aiming at the character level morphological characteristics, the character level characteristics and the extraction process of the character level characteristics, so that word embedding expression suitable for the characteristics of the Chinese characters is obtained. The method can enhance the word embedding effect and provide a certain technical support for the practice in the fields of Chinese natural language processing, text mining and the like.

Description

Morphological double-channel Chinese word embedding method based on strokes and fonts
Technical Field
The invention relates to the field of natural language processing, in particular to a morphological double-channel Chinese word embedding method based on strokes and fonts.
Background
Natural language is a complex set of systems used by humans to express and convey information. In this set of systems, the words are the basic units of the meaning. Word vectors, as their name implies, are vectors used to represent words and may also be considered as feature vectors or tokens of words. The technique of mapping words into real number domain vectors is also called word embedding. Word embedding has been a subject of extensive research as a cornerstone of natural language tasks.
In recent years, globalization of information causes text information on the internet to show explosive growth, wherein the proportion and influence of Chinese text are increasing, and natural language processing methods for Chinese, especially word embedding methods as task bases are receiving more and more attention. Chinese is a language derived from pictographs, has very rich morphological meanings, and is reflected not only in one-dimensional stroke order characteristics, but also in fonts in two-dimensional space. Recent studies have demonstrated that characterizing morphological features aids in feature capture for word embedding. Therefore, enhancing the effect of Chinese word embedding by using morphological information becomes an important issue for Chinese natural language processing tasks.
Currently, word embedding methods for chinese are either migration of methods designed for alphabetic languages represented by english, or independent characterization of morphological features of chinese, such as strokes and glyphs. The former ignores that chinese is a morpheme language, which is essentially different from alphabetic languages such as english, and therefore has poor results when applied to chinese text processing. The latter is to split morphological features, which cannot capture the features of each dimension of morphology effectively, and is therefore very limited. Therefore, how to fully utilize morphological features to enhance the effect of Chinese word embedding still has many opportunities and challenges.
Disclosure of Invention
The invention aims to provide a morphological double-channel Chinese word embedding method based on strokes and fonts, which can enhance the word embedding effect and provides a certain technical support for the practice in the fields of Chinese natural language processing, text mining and the like.
The invention aims at realizing the following technical scheme:
a morphological double-channel Chinese word embedding method based on strokes and fonts comprises the following steps:
obtaining a Chinese text, and obtaining a corresponding word sequence through preprocessing;
each word in the word sequence is split into a plurality of Chinese characters, and modeling is carried out according to the stroke order information and the font picture information of the Chinese characters and aiming at the character level morphological characteristics, the character level characteristics and the extraction process of the character level characteristics, so that word embedding expression suitable for the characteristics of the Chinese characters is obtained.
According to the technical scheme provided by the invention, the word embedding modeling is performed on the Chinese text by the morphological double-channel Chinese word embedding method based on strokes and fonts, and compared with the traditional processing method, the method can more effectively perform vector characterization on the Chinese text by means of stroke order and font information, so that richer morphological information is provided, good interpretability is achieved, and better downstream characteristic data is provided for natural language processing tasks of the Internet. The method has a certain practical application value and can bring certain potential economic benefits to some related text information platforms.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a method for embedding morphological double-channel Chinese words based on strokes and fonts provided by an embodiment of the invention;
FIG. 2 is a formal descriptive diagram of morphological features of Chinese characters according to an embodiment of the present invention;
FIG. 3 is a diagram of a model framework of a method for morphological dual-channel Chinese word embedding based on strokes and fonts, which is provided by an embodiment of the invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.
As shown in FIG. 1, a morphological dual-channel Chinese word embedding method based on strokes and fonts is provided for the embodiment of the invention; as shown in fig. 2, a model frame corresponding to the method is provided. The method mainly comprises the following steps:
step 1, acquiring a Chinese text, and obtaining a corresponding word sequence through preprocessing.
In the embodiment of the invention, a specified number of Chinese text corpus data sets (used for model training, the specific number can be selected according to actual conditions) are crawled from an open-source Chinese corpus in advance; crawling the stroke order information of Chinese characters from the open source dictionary data, wherein the stroke order information comprises the Chinese characters and the stroke order (32 pictures) of the Chinese characters; and generating font image information (28X 28,1 BIT) of the Chinese characters.
To ensure the model effect, a pre-treatment is required before modeling.
1) And performing word segmentation processing on the Chinese text.
In the embodiment of the invention, the Chinese text needs to be divided into word sequences.
2) And removing the text with the word number smaller than the set value in the word segmentation result.
In the present example, it is desirable to remove some of the lower quality data. It is generally considered that texts having a smaller number of words than a set number of texts contain only a smaller amount of information and have lower quality. The set number here may be, for example, 5.
3) And removing the stop word to obtain a corresponding word sequence.
In the present example, it is desirable to remove certain stop words, such as "you", "this", etc. The stop words in the text content are generally considered to be high-frequency indicative words, have large semantic variability, are not suitable for word embedding tasks and need to be removed.
And 2, splitting each word in the word sequence into a plurality of Chinese characters, and modeling according to the stroke order information and the font information of the Chinese characters and aiming at the character level morphological characteristics, the character level characteristics and the extraction process of the character level characteristics, so as to obtain word embedding expression suitable for the characteristics of the Chinese characters.
Word embedding expression is to represent language concepts and semantic units (words) of human social abstraction by mathematical real-valued vectors, so that the real-valued vectors reflect semantic links among different words as accurately as possible. The word embedding expression with high quality is a basic stone of a downstream task in the field of natural language processing, because any natural language task (such as text classification) needs to convert the text from an abstract concept to a mathematical real value vector which can be processed by a computer, and after the real value vector representation of the text is obtained, a subsequent deep learning modeling process (each natural language downstream task corresponds to a processing process) can be performed. For a text, it is composed of words, so the representation of the text ultimately depends on the superposition of the word embedding representations (how to superpose, and in what way the superposition corresponds to the word embedding model and algorithm).
It should be noted that, the embodiment of the present invention only protects the word embedding expression extraction mode, and the specific natural language task using the extracted word embedding expression can be determined by those skilled in the art according to their own needs. For example, it may be the text classification task mentioned above, and the subsequent procedures involved may also be implemented with reference to conventional techniques.
In fact, chinese and english are subject to great differences, both in language nature and in the literal system, especially in the characterization of morphological features. However, the conventional word embedding method cannot well capture morphological characteristics of chinese. In order to model the complex morphological structure and characteristics of Chinese words, the morphological double-channel Chinese word embedding model (DWE) based on strokes and fonts is designed. As shown in fig. 2, the DWE model characterizes three granularities: word-level morphological features (stroke order, font), word-level features (fusion of stroke order features and font features), word-level features (fusion of word-level features). For each input word, firstly splitting the word into a plurality of Chinese characters, obtaining stroke sequences and font pictures of each Chinese character according to the collected stroke order and font information, and then extracting character level morphological characteristics, character level characteristics and word level characteristics step by step.
1. Word level morphological features.
The word-level morphological features mainly include: the stroke order characteristics of the one-dimensional sequence channel and the font characteristics of the two-dimensional space channel.
1) Stroke order feature of one-dimensional sequence channel
Each Chinese character can be decomposed into a definite and unchanged stroke sequence, namely the stroke sequence, and the strokes in the stroke sequence can be combined to obtain morphological components such as components, radicals and the like, which are similar to the prefix, the suffix and the root of English. These particular stroke sequences, or morphological components, may reflect inherent, shared semantics. As shown in the upper half of fig. 3, the kanji "drive" can be broken down into a sequence of eight strokes, where the last three strokes together correspond to their radical "horse", while the first two (force) and middle three (mouth) strokes characterize the original real scene of the action "drive" (pull the reins with force, while the mouth issues instructions).
In the embodiment of the invention, the extraction mode of the stroke order characteristic of the one-dimensional sequence channel in the word-level morphological characteristic mainly comprises the following steps:
for the Chinese character c, determining the stroke order of the Chinese character c according to the stroke order information of the Chinese character to obtain a corresponding stroke sequence;
setting a sliding window with the size of n to extract the subword combination of the stroke order;
adding boundary symbols < > and < > to the head and the tail of the stroke sequence of the Chinese character c to obtain a new stroke sequence;
sequentially disassembling a plurality of stroke combinations by taking n strokes as a group from front to back, and taking the stroke order added with the boundary sign as a special subword;
the sub-word combination contained in the final Chinese character c is marked as G (c).
For example, for a Chinese character "beat", if n is 1, it means that the strokes are disassembled from one picture to another, and the stroke units such as horizontal and vertical left-falling strokes, and the boundary symbols "<" and ">; if n is 2, it means that the two pictures are separated from each other, and the combination of the two pictures contains "< one", "…" one "and" one "," one > "; if n is 3, it represents the three-picture break down, which contains radicals that can express specific semantics: by the handle (bar).
The character of the stroke order is extracted by using the sub-word embedding mode, and the character of the stroke order is a 'driving' (stroke order: -fold-back type one fold-back type:)
Figure BDA0002205923780000041
One) is exemplified by the following method:
setting a sliding window with n=3, and adding two special boundary symbols "<" and ">" to the head and tail of the stroke order Stoke to obtain a new stroke sequence:
fold-back type fold-back
Figure BDA0002205923780000053
>
Sequentially disassembling a plurality of stroke combinations from front to back by taking three strokes as a group:
a fold-back, a fold-back, a back, fold-back one, fold-back one fold-back
Figure BDA0002205923780000054
/>
Figure BDA0002205923780000055
One (I)/(II)>
Figure BDA0002205923780000056
>
Meanwhile, the stroke order of the added boundary symbol is used as a special subword, namely:
fold-back type fold-back
Figure BDA0002205923780000057
>
Recording the sub word combination set in all Chinese characters as G, for any Chinese character c, the sub word combination contained in the Chinese character c is G (c), and for each sub word G epsilon G (c), giving a sub word feature vector
Figure BDA0002205923780000051
The value of the parameter is the parameter to be optimized, and training and optimization can be performed in the subsequent stage.
It can be understood by those skilled in the art that the boundary symbol is the boundary for distinguishing different Chinese characters as the name implies, and is additionally added with "<" and ">", for example, the Chinese character "driving" is 8 pictures, the "<" is added before the 8 picture sequence which is disassembled by the "driving", the ">" is added at the end, so that the total is 10 pictures, n is used as the window size to scan from left to right n when extracting the stroke n-gram, and when n takes different values in one value interval, a stroke n-gram set is obtained, and the set is all the sub-word combinations G (c) corresponding to the Chinese character c. n-gram is a specific noun of a sub-phrase, which is a special phrase obtained by processing the stroke order and used as a feature in a word embedding method. In the preceding paragraph, it is explained in detail how the 3-gram operates and the 3-gram sub-phrase is obtained.
2) Character-form feature
Since chinese is a morpheme language derived from oracle (a pictographic character), its spatial structure, i.e., glyphs, can also convey rich semantic information. The key reason why Chinese characters are so rich in form information is that the same strokes can convey different semantic information in different combination modes in a two-dimensional space. As shown in the lower half of fig. 3, three chinese characters, "human", "enter" and "eight", have identical sequences of strokes, but they have completely different semantics because their spatial combinations of strokes are different. The method for extracting the character form features by using the convolutional neural network is as follows:
for the Chinese character c, according to the font image information, obtaining a corresponding font image I c Extracting the font characteristics by using a LeNet convolutional neural network:
Figure BDA0002205923780000052
the structure of the CNN network and related parameters are given below by way of example.
The CNN network includes: input layer, C1 layer (first convolution layer), S2 layer (first pooling layer), C3 layer (second convolution layer), S4 layer (second pooling layer), F5 layer (fully connected layer), output layer.
The parameters of each layer are as follows:
layer C1: 20 convolution kernels, each convolution kernel size 5x5;
s2 layer: maximum pooling core (MaxPooling), pool core size 2x2;
layer C3: 50 convolution kernels, each convolution kernel 5x5 in size;
s4 layer: maximum pooling core (MaxPooling), pool core size 2x2;
f5 layer: the dimension 500 is output.
2. Word level features
The morphological information in Chinese consists of two parts: the one-dimensional sequence information represented by the stroke order and the two-dimensional space information represented by the font need to be combined when the character level features are represented.
In the embodiment of the invention, the two are combined by adopting the operation of component combination, and the method is as follows:
for Chinese character c, the stroke order of the one-dimensional sequence channel is characterized by a sub-word combination G (c), and each element contains a sub-word feature vector
Figure BDA0002205923780000061
The two-dimensional spatial channel is characterized by a character pattern CNN (I c ) The method comprises the steps of carrying out a first treatment on the surface of the Obtaining character level characteristic representation of Chinese character c by using component combination operation>
Figure BDA0002205923780000062
Figure BDA0002205923780000063
Where x is a component combination operator, there are various choices, such as addition and dot product.
3. Word level features
The word level features are obtained by fusing word level features: the Chinese characters in each word are accumulated and summed (N c For the number of Chinese characters contained in each word), a word-level feature representation of the word-level features is obtained
Figure BDA0002205923780000064
Reunion word level characterization->
Figure BDA0002205923780000065
Component combination is performed to obtain word level characteristics +.>
Figure BDA0002205923780000066
Namely:
Figure BDA0002205923780000067
Figure BDA0002205923780000068
here, the
Figure BDA0002205923780000069
Representing vector addition. Through the characteristic characterization of the three granularities, the characterization vector +.>
Figure BDA00022059237800000610
Figure BDA00022059237800000611
In the training stage, optimizing and training the model by utilizing a preset crawled Chinese text corpus data set D with a specified number;
for word w, from the distribution P (typically the unigram distribution unigram distribution), a negative set of samples of size λ (typically 5) is extracted, and the final optimization objective is optimized using maximum likelihood estimation:
Figure BDA0002205923780000071
wherein s (w, e) represents a similarity function in the skip word model, w is a central word, e is a window background word of the central word w, T (w) is a context window word set of the central word w, lambda is the number of negative samples of each central word w, e' is a negative sample noise word obtained by negative sampling,
Figure BDA0002205923780000072
is the desired function term, σ is the sigmoid function.
In the embodiment of the invention, the similarity function in the skip word model is expressed as follows:
Figure BDA0002205923780000073
wherein ,
Figure BDA0002205923780000074
and->
Figure BDA0002205923780000075
Vector representations of word w and word e, respectively.
And after training, evaluating the performance of the model by using the test task data set.
It will be appreciated by those skilled in the art that both the training phase and the testing phase are performed in the manner described above with reference to steps 1-2.
According to the scheme provided by the embodiment of the invention, for a Chinese text, the text can be segmented and mapped into three granularity characteristics by utilizing a Chinese word segmentation tool and the stroke order and character shape characteristics of Chinese characters, namely: the character sequence comprises a character sequence contained in each character, a stroke order sequence corresponding to each character and a character pattern picture. The word sequence and the character sequence are two most commonly used features in a word embedding task, the stroke sequence and the character pattern of the characters are two very important features in Chinese morphology, the two features respectively describe one-dimensional sequence morphological features and two-dimensional space morphological features of Chinese, and more Chinese semantic information is more implicit and low-level. The low-level stroke order features and the font features are fused and then are combined upwards step by step, so that the Chinese morphological features are fused into the modeling process of word embedding, and richer language features are provided for the word embedding model.
According to the technical scheme provided by the invention, the word embedding modeling is performed on the Chinese text by using the morphological double-channel Chinese word embedding method based on strokes and fonts, and compared with the traditional processing method, the method can more effectively perform vector characterization on the Chinese text by means of stroke order and font information, thereby providing richer morphological information, good interpretability and better downstream characteristic data for natural language processing tasks of the Internet. The method has a certain practical application value and can bring certain potential economic benefits to some related text information platforms.
The above description is mainly directed to related schemes of the invention, and the following description is directed to related technologies of word embedding tasks so as to facilitate understanding of the invention.
For word embedding tasks, the goal is to represent each word in the text as a vector of fixed dimensions and to make these vectors better express the similarity and analogy relationships between different words, i.e. for two words x, y. The similarity is defined as the cosine value of the included angle between the vectorized representation x and y, namely the cosine similarity:
Figure BDA0002205923780000081
the upper part of the formula is denoted as similarity function s (x, y) =x·y. More formally, the task is to give a set of Chinese text data, learn and iteratively update the embedded representation of words by assuming that the words appearing simultaneously in the text data have greater similarity, so as to obtain an embedded representation of words, so that the embedded representation can have better accuracy when applied to similarity and analogy tasks. For example, by looking up a dictionary, in a similarity or analogy task, find the word vector of the word to be compared in the task, e.g., compare two words, king and queen, then find the word vectors of the two words respectively, and calculate their cosine similarity.
An expression of a character skip model is used, and a text is taken as an example, and the character skip model is marked as S. Firstly, a word segmentation tool is used for segmenting the word into word sequences T, and a 'lark flies over from the blue sky'. And setting the window size of the background word as 2 by taking the blue sky as a central word c. The problem is embodied in that, given a central word c, a conditional probability of a background word that is no more than two words away from it is generated. More specifically, each word is represented as two d-dimensional vectors, which are used to calculate the conditional probability. Assuming that a word indexes i in the dictionary, a vector is represented as if it were a center word
Figure BDA0002205923780000082
And the vector is expressed as ++for the background word>
Figure BDA0002205923780000083
Let the center word be w c The vector is denoted +.>
Figure BDA0002205923780000084
The background word is w e The vector is denoted +.>
Figure BDA0002205923780000085
The conditional probability of a given center word generating a background word can be obtained by softmax operation on the vector inner product:
Figure BDA0002205923780000086
wherein the dictionary index set
Figure BDA0002205923780000087
Given a text sequence of length N, let the word of time step t be w (t) . Assuming that the generation of the background words is independent under the condition of the given central word, when the window background size is m, the probability of generating all the background words by the given arbitrary central word is given, namely the likelihood function is:
Figure BDA0002205923780000088
all time steps less than 1 and greater than N are ignored. The goal is to maximize the likelihood function described above.
However, the events contained in the above model only consider positive class samples. This results in that the joint probability above is maximized to 1 when all word vectors are equal and have an infinite value. It is clear that such a word vector is meaningless. Thus, for each word w, we extract a negative set of samples T (w) of size λ (typically 5) from the distribution P (typically a unitary distribution), while the events containing positive and negative samples are independent of each other, while letting:
Figure BDA0002205923780000089
wherein P (d= 1|w) (t) ,w (t+j) )=σ(s(w (t) ,w (t+j) ) Representing the background word w (t+j) Appear in the center word w (t) Is a sigmoid function
Figure BDA0002205923780000091
The joint probability of the positive sample to be considered to the maximum degree is rewritten into a log-likelihood function to obtain an objective function to be optimized
Figure BDA0002205923780000092
Figure BDA0002205923780000093
wherein ,
Figure BDA0002205923780000094
for the desired item, second item->
Figure BDA0002205923780000095
Equivalent to
Figure BDA0002205923780000096
From the description of the above embodiments, it will be apparent to those skilled in the art that the above embodiments may be implemented in software, or may be implemented by means of software plus a necessary general hardware platform. With such understanding, the technical solutions of the foregoing embodiments may be embodied in a software product, where the software product may be stored in a nonvolatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.), and include several instructions for causing a computer device (may be a personal computer, a server, or a network device, etc.) to perform the methods of the embodiments of the present invention.
The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions easily contemplated by those skilled in the art within the scope of the present invention should be included in the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims (4)

1. A morphological double-channel Chinese word embedding method based on strokes and fonts is characterized by comprising the following steps:
obtaining a Chinese text, and obtaining a corresponding word sequence through preprocessing;
splitting each word in the word sequence into a plurality of Chinese characters, and modeling according to the stroke order information and the font picture information of the Chinese characters and aiming at the character level morphological characteristics, the character level characteristics and the extraction process of the character level characteristics so as to obtain word embedding expression suitable for the characteristics of the Chinese characters;
the word-level morphological features include: the stroke order characteristic of the one-dimensional sequence channel and the character shape characteristic of the two-dimensional space channel;
the extraction mode of the stroke order characteristic of the one-dimensional sequence channel in the word-level morphological characteristic comprises the following steps: for the Chinese character c, determining the stroke order of the Chinese character c according to the stroke order information of the Chinese character to obtain a corresponding stroke sequence; setting a sliding window with the size of n to extract the subword combination of the stroke order; adding boundary symbols < > and < > to the head and the tail of the stroke sequence of the Chinese character c to obtain a new stroke sequence; sequentially disassembling a plurality of stroke combinations by taking n strokes as a group from front to back, and taking the stroke order added with the boundary sign as a special subword; the sub-word combination contained in the final Chinese character c is marked as G (c);
the extraction mode of the font characteristic of the two-dimensional space channel in the character level morphological characteristic comprises the following steps: for the Chinese character c, according to the font image information, obtaining a corresponding font image I c Extracting the character pattern features by using a CNN network:
Figure FDA0004167053070000011
the character level features are obtained by fusing the stroke order features of the one-dimensional sequence channels and the font features of the two-dimensional space channels in the character level morphological features; for Chinese character c, the stroke order of the one-dimensional sequence channel is characterized by a sub-word combination G (c), and each element contains a sub-word feature vector
Figure FDA0004167053070000012
The two-dimensional spatial channel is characterized by a character pattern CNN (I c ) The method comprises the steps of carrying out a first treatment on the surface of the Operation of using composition combination to obtain character level characteristic representation of Chinese character>
Figure FDA00041670530700000110
Figure FDA0004167053070000013
Wherein, is a constituent combination operator;
the word level features are obtained by fusing word level features: accumulating and summing the Chinese characters in each word to obtain word-level feature representation in word-level features
Figure FDA0004167053070000014
Reunion word level characterization->
Figure FDA0004167053070000015
Component combination is performed to obtain word level characteristics +.>
Figure FDA0004167053070000016
Namely:
Figure FDA0004167053070000017
Figure FDA0004167053070000018
wherein ,Nc For the number of chinese characters contained in each word,
Figure FDA0004167053070000019
representing vector addition;
optimizing and training the model by utilizing a preset crawled Chinese text corpus data set D with a specified number;
for word w, a negative set of samples T (w) of size λ is extracted from the distribution P, and the final optimization objective is optimized using maximum likelihood estimation:
Figure FDA0004167053070000021
wherein s (w, e) represents a similarity function in the skip word model, w is a central word, e is a window background word of the central word w, T (w) is a context window word set of the central word w, lambda is the number of negative samples of each central word w, e' is a negative sample noise word obtained by negative sampling,
Figure FDA0004167053070000022
is the desired function term, σ is the sigmoid function.
2. The method for embedding the Chinese word in the morphological dual-channel mode based on strokes and fonts according to claim 1, wherein the preprocessing mode comprises the following steps:
word segmentation processing is carried out on the Chinese text;
removing texts with the word numbers smaller than a set value in the word segmentation result;
and removing the stop word to obtain a corresponding word sequence.
3. The method for embedding the Chinese word in the morphology dual-channel mode based on strokes and fonts according to claim 1, wherein the stroke order information and the font information of the Chinese character are crawled from open source dictionary data in advance.
4. The method for embedding a Chinese word in a morphological dual-channel based on strokes and fonts according to claim 1, wherein the similarity function in the skip model is expressed as:
Figure FDA0004167053070000023
wherein ,
Figure FDA0004167053070000024
and->
Figure FDA0004167053070000025
Word level features of word w and word e, respectively.
CN201910881062.0A 2019-09-18 2019-09-18 Morphological double-channel Chinese word embedding method based on strokes and fonts Active CN110610006B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910881062.0A CN110610006B (en) 2019-09-18 2019-09-18 Morphological double-channel Chinese word embedding method based on strokes and fonts

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910881062.0A CN110610006B (en) 2019-09-18 2019-09-18 Morphological double-channel Chinese word embedding method based on strokes and fonts

Publications (2)

Publication Number Publication Date
CN110610006A CN110610006A (en) 2019-12-24
CN110610006B true CN110610006B (en) 2023-06-20

Family

ID=68892871

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910881062.0A Active CN110610006B (en) 2019-09-18 2019-09-18 Morphological double-channel Chinese word embedding method based on strokes and fonts

Country Status (1)

Country Link
CN (1) CN110610006B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111476036A (en) * 2020-04-10 2020-07-31 电子科技大学 Word embedding learning method based on Chinese word feature substrings
CN111539437B (en) * 2020-04-27 2022-06-28 西南大学 Detection and identification method of oracle-bone inscription components based on deep learning

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107273355A (en) * 2017-06-12 2017-10-20 大连理工大学 A kind of Chinese word vector generation method based on words joint training
CN109408814A (en) * 2018-09-30 2019-03-01 中国地质大学(武汉) Across the language vocabulary representative learning method and system of China and Britain based on paraphrase primitive word
CN109858039A (en) * 2019-03-01 2019-06-07 北京奇艺世纪科技有限公司 A kind of text information identification method and identification device

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101408873A (en) * 2007-10-09 2009-04-15 劳英杰 Full scope semantic information integrative cognition system and application thereof
CN109992783B (en) * 2019-04-03 2020-10-30 同济大学 Chinese word vector modeling method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107273355A (en) * 2017-06-12 2017-10-20 大连理工大学 A kind of Chinese word vector generation method based on words joint training
CN109408814A (en) * 2018-09-30 2019-03-01 中国地质大学(武汉) Across the language vocabulary representative learning method and system of China and Britain based on paraphrase primitive word
CN109858039A (en) * 2019-03-01 2019-06-07 北京奇艺世纪科技有限公司 A kind of text information identification method and identification device

Also Published As

Publication number Publication date
CN110610006A (en) 2019-12-24

Similar Documents

Publication Publication Date Title
CN108614875B (en) Chinese emotion tendency classification method based on global average pooling convolutional neural network
Cao et al. A joint model for word embedding and word morphology
CN110096698B (en) Topic-considered machine reading understanding model generation method and system
CN109960804B (en) Method and device for generating topic text sentence vector
Fonseca et al. Mac-morpho revisited: Towards robust part-of-speech tagging
CN111966812B (en) Automatic question answering method based on dynamic word vector and storage medium
CN105068997B (en) The construction method and device of parallel corpora
US10810467B2 (en) Flexible integrating recognition and semantic processing
CN111475622A (en) Text classification method, device, terminal and storage medium
CN112966525B (en) Law field event extraction method based on pre-training model and convolutional neural network algorithm
CN112861524A (en) Deep learning-based multilevel Chinese fine-grained emotion analysis method
Theeramunkong et al. Non-dictionary-based Thai word segmentation using decision trees
CN110610006B (en) Morphological double-channel Chinese word embedding method based on strokes and fonts
Sifa et al. Towards contradiction detection in german: a translation-driven approach
KR101988165B1 (en) Method and system for improving the accuracy of speech recognition technology based on text data analysis for deaf students
CN112287240A (en) Case microblog evaluation object extraction method and device based on double-embedded multilayer convolutional neural network
CN113255331B (en) Text error correction method, device and storage medium
CN111159405B (en) Irony detection method based on background knowledge
CN110929022A (en) Text abstract generation method and system
Jindal A deep learning approach for arabic caption generation using roots-words
CN111191413B (en) Method, device and system for automatically marking event core content based on graph sequencing model
CN115878847B (en) Video guiding method, system, equipment and storage medium based on natural language
KR102569381B1 (en) System and Method for Machine Reading Comprehension to Table-centered Web Documents
CN115130475A (en) Extensible universal end-to-end named entity identification method
CN113987120A (en) Public sentiment emotion classification method based on deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant