CN110610006B - Morphological double-channel Chinese word embedding method based on strokes and fonts - Google Patents
Morphological double-channel Chinese word embedding method based on strokes and fonts Download PDFInfo
- Publication number
- CN110610006B CN110610006B CN201910881062.0A CN201910881062A CN110610006B CN 110610006 B CN110610006 B CN 110610006B CN 201910881062 A CN201910881062 A CN 201910881062A CN 110610006 B CN110610006 B CN 110610006B
- Authority
- CN
- China
- Prior art keywords
- word
- chinese
- character
- morphological
- level
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 230000000877 morphologic effect Effects 0.000 title claims abstract description 46
- 238000000034 method Methods 0.000 title claims abstract description 39
- 238000000605 extraction Methods 0.000 claims abstract description 8
- 238000007781 pre-processing Methods 0.000 claims abstract description 5
- 239000013598 vector Substances 0.000 claims description 29
- 108091006146 Channels Proteins 0.000 claims description 14
- 238000012549 training Methods 0.000 claims description 7
- 230000011218 segmentation Effects 0.000 claims description 6
- 238000012545 processing Methods 0.000 claims description 5
- 238000005457 optimization Methods 0.000 claims description 3
- 238000007476 Maximum Likelihood Methods 0.000 claims description 2
- 238000005070 sampling Methods 0.000 claims description 2
- 239000000470 constituent Substances 0.000 claims 1
- 238000003058 natural language processing Methods 0.000 abstract description 8
- 230000000694 effects Effects 0.000 abstract description 5
- 238000005065 mining Methods 0.000 abstract description 2
- 230000006870 function Effects 0.000 description 10
- 238000012512 characterization method Methods 0.000 description 6
- 238000013527 convolutional neural network Methods 0.000 description 5
- 238000011176 pooling Methods 0.000 description 4
- 235000019580 granularity Nutrition 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 238000010586 diagram Methods 0.000 description 2
- 230000004927 fusion Effects 0.000 description 2
- 238000003672 processing method Methods 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 241000282412 Homo Species 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 230000009193 crawling Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 239000002360 explosive Substances 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000013508 migration Methods 0.000 description 1
- 230000005012 migration Effects 0.000 description 1
- 238000002203 pretreatment Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 239000004575 stone Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Document Processing Apparatus (AREA)
Abstract
The invention discloses a morphological double-channel Chinese word embedding method based on strokes and fonts, which comprises the following steps: obtaining a Chinese text, and obtaining a corresponding word sequence through preprocessing; each word in the word sequence is split into a plurality of Chinese characters, and modeling is carried out according to the stroke order information and the font picture information of the Chinese characters and aiming at the character level morphological characteristics, the character level characteristics and the extraction process of the character level characteristics, so that word embedding expression suitable for the characteristics of the Chinese characters is obtained. The method can enhance the word embedding effect and provide a certain technical support for the practice in the fields of Chinese natural language processing, text mining and the like.
Description
Technical Field
The invention relates to the field of natural language processing, in particular to a morphological double-channel Chinese word embedding method based on strokes and fonts.
Background
Natural language is a complex set of systems used by humans to express and convey information. In this set of systems, the words are the basic units of the meaning. Word vectors, as their name implies, are vectors used to represent words and may also be considered as feature vectors or tokens of words. The technique of mapping words into real number domain vectors is also called word embedding. Word embedding has been a subject of extensive research as a cornerstone of natural language tasks.
In recent years, globalization of information causes text information on the internet to show explosive growth, wherein the proportion and influence of Chinese text are increasing, and natural language processing methods for Chinese, especially word embedding methods as task bases are receiving more and more attention. Chinese is a language derived from pictographs, has very rich morphological meanings, and is reflected not only in one-dimensional stroke order characteristics, but also in fonts in two-dimensional space. Recent studies have demonstrated that characterizing morphological features aids in feature capture for word embedding. Therefore, enhancing the effect of Chinese word embedding by using morphological information becomes an important issue for Chinese natural language processing tasks.
Currently, word embedding methods for chinese are either migration of methods designed for alphabetic languages represented by english, or independent characterization of morphological features of chinese, such as strokes and glyphs. The former ignores that chinese is a morpheme language, which is essentially different from alphabetic languages such as english, and therefore has poor results when applied to chinese text processing. The latter is to split morphological features, which cannot capture the features of each dimension of morphology effectively, and is therefore very limited. Therefore, how to fully utilize morphological features to enhance the effect of Chinese word embedding still has many opportunities and challenges.
Disclosure of Invention
The invention aims to provide a morphological double-channel Chinese word embedding method based on strokes and fonts, which can enhance the word embedding effect and provides a certain technical support for the practice in the fields of Chinese natural language processing, text mining and the like.
The invention aims at realizing the following technical scheme:
a morphological double-channel Chinese word embedding method based on strokes and fonts comprises the following steps:
obtaining a Chinese text, and obtaining a corresponding word sequence through preprocessing;
each word in the word sequence is split into a plurality of Chinese characters, and modeling is carried out according to the stroke order information and the font picture information of the Chinese characters and aiming at the character level morphological characteristics, the character level characteristics and the extraction process of the character level characteristics, so that word embedding expression suitable for the characteristics of the Chinese characters is obtained.
According to the technical scheme provided by the invention, the word embedding modeling is performed on the Chinese text by the morphological double-channel Chinese word embedding method based on strokes and fonts, and compared with the traditional processing method, the method can more effectively perform vector characterization on the Chinese text by means of stroke order and font information, so that richer morphological information is provided, good interpretability is achieved, and better downstream characteristic data is provided for natural language processing tasks of the Internet. The method has a certain practical application value and can bring certain potential economic benefits to some related text information platforms.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a method for embedding morphological double-channel Chinese words based on strokes and fonts provided by an embodiment of the invention;
FIG. 2 is a formal descriptive diagram of morphological features of Chinese characters according to an embodiment of the present invention;
FIG. 3 is a diagram of a model framework of a method for morphological dual-channel Chinese word embedding based on strokes and fonts, which is provided by an embodiment of the invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.
As shown in FIG. 1, a morphological dual-channel Chinese word embedding method based on strokes and fonts is provided for the embodiment of the invention; as shown in fig. 2, a model frame corresponding to the method is provided. The method mainly comprises the following steps:
In the embodiment of the invention, a specified number of Chinese text corpus data sets (used for model training, the specific number can be selected according to actual conditions) are crawled from an open-source Chinese corpus in advance; crawling the stroke order information of Chinese characters from the open source dictionary data, wherein the stroke order information comprises the Chinese characters and the stroke order (32 pictures) of the Chinese characters; and generating font image information (28X 28,1 BIT) of the Chinese characters.
To ensure the model effect, a pre-treatment is required before modeling.
1) And performing word segmentation processing on the Chinese text.
In the embodiment of the invention, the Chinese text needs to be divided into word sequences.
2) And removing the text with the word number smaller than the set value in the word segmentation result.
In the present example, it is desirable to remove some of the lower quality data. It is generally considered that texts having a smaller number of words than a set number of texts contain only a smaller amount of information and have lower quality. The set number here may be, for example, 5.
3) And removing the stop word to obtain a corresponding word sequence.
In the present example, it is desirable to remove certain stop words, such as "you", "this", etc. The stop words in the text content are generally considered to be high-frequency indicative words, have large semantic variability, are not suitable for word embedding tasks and need to be removed.
And 2, splitting each word in the word sequence into a plurality of Chinese characters, and modeling according to the stroke order information and the font information of the Chinese characters and aiming at the character level morphological characteristics, the character level characteristics and the extraction process of the character level characteristics, so as to obtain word embedding expression suitable for the characteristics of the Chinese characters.
Word embedding expression is to represent language concepts and semantic units (words) of human social abstraction by mathematical real-valued vectors, so that the real-valued vectors reflect semantic links among different words as accurately as possible. The word embedding expression with high quality is a basic stone of a downstream task in the field of natural language processing, because any natural language task (such as text classification) needs to convert the text from an abstract concept to a mathematical real value vector which can be processed by a computer, and after the real value vector representation of the text is obtained, a subsequent deep learning modeling process (each natural language downstream task corresponds to a processing process) can be performed. For a text, it is composed of words, so the representation of the text ultimately depends on the superposition of the word embedding representations (how to superpose, and in what way the superposition corresponds to the word embedding model and algorithm).
It should be noted that, the embodiment of the present invention only protects the word embedding expression extraction mode, and the specific natural language task using the extracted word embedding expression can be determined by those skilled in the art according to their own needs. For example, it may be the text classification task mentioned above, and the subsequent procedures involved may also be implemented with reference to conventional techniques.
In fact, chinese and english are subject to great differences, both in language nature and in the literal system, especially in the characterization of morphological features. However, the conventional word embedding method cannot well capture morphological characteristics of chinese. In order to model the complex morphological structure and characteristics of Chinese words, the morphological double-channel Chinese word embedding model (DWE) based on strokes and fonts is designed. As shown in fig. 2, the DWE model characterizes three granularities: word-level morphological features (stroke order, font), word-level features (fusion of stroke order features and font features), word-level features (fusion of word-level features). For each input word, firstly splitting the word into a plurality of Chinese characters, obtaining stroke sequences and font pictures of each Chinese character according to the collected stroke order and font information, and then extracting character level morphological characteristics, character level characteristics and word level characteristics step by step.
1. Word level morphological features.
The word-level morphological features mainly include: the stroke order characteristics of the one-dimensional sequence channel and the font characteristics of the two-dimensional space channel.
1) Stroke order feature of one-dimensional sequence channel
Each Chinese character can be decomposed into a definite and unchanged stroke sequence, namely the stroke sequence, and the strokes in the stroke sequence can be combined to obtain morphological components such as components, radicals and the like, which are similar to the prefix, the suffix and the root of English. These particular stroke sequences, or morphological components, may reflect inherent, shared semantics. As shown in the upper half of fig. 3, the kanji "drive" can be broken down into a sequence of eight strokes, where the last three strokes together correspond to their radical "horse", while the first two (force) and middle three (mouth) strokes characterize the original real scene of the action "drive" (pull the reins with force, while the mouth issues instructions).
In the embodiment of the invention, the extraction mode of the stroke order characteristic of the one-dimensional sequence channel in the word-level morphological characteristic mainly comprises the following steps:
for the Chinese character c, determining the stroke order of the Chinese character c according to the stroke order information of the Chinese character to obtain a corresponding stroke sequence;
setting a sliding window with the size of n to extract the subword combination of the stroke order;
adding boundary symbols < > and < > to the head and the tail of the stroke sequence of the Chinese character c to obtain a new stroke sequence;
sequentially disassembling a plurality of stroke combinations by taking n strokes as a group from front to back, and taking the stroke order added with the boundary sign as a special subword;
the sub-word combination contained in the final Chinese character c is marked as G (c).
For example, for a Chinese character "beat", if n is 1, it means that the strokes are disassembled from one picture to another, and the stroke units such as horizontal and vertical left-falling strokes, and the boundary symbols "<" and ">; if n is 2, it means that the two pictures are separated from each other, and the combination of the two pictures contains "< one", "…" one "and" one "," one > "; if n is 3, it represents the three-picture break down, which contains radicals that can express specific semantics: by the handle (bar).
The character of the stroke order is extracted by using the sub-word embedding mode, and the character of the stroke order is a 'driving' (stroke order: -fold-back type one fold-back type:)One) is exemplified by the following method:
setting a sliding window with n=3, and adding two special boundary symbols "<" and ">" to the head and tail of the stroke order Stoke to obtain a new stroke sequence:
Sequentially disassembling a plurality of stroke combinations from front to back by taking three strokes as a group:
Meanwhile, the stroke order of the added boundary symbol is used as a special subword, namely:
Recording the sub word combination set in all Chinese characters as G, for any Chinese character c, the sub word combination contained in the Chinese character c is G (c), and for each sub word G epsilon G (c), giving a sub word feature vectorThe value of the parameter is the parameter to be optimized, and training and optimization can be performed in the subsequent stage.
It can be understood by those skilled in the art that the boundary symbol is the boundary for distinguishing different Chinese characters as the name implies, and is additionally added with "<" and ">", for example, the Chinese character "driving" is 8 pictures, the "<" is added before the 8 picture sequence which is disassembled by the "driving", the ">" is added at the end, so that the total is 10 pictures, n is used as the window size to scan from left to right n when extracting the stroke n-gram, and when n takes different values in one value interval, a stroke n-gram set is obtained, and the set is all the sub-word combinations G (c) corresponding to the Chinese character c. n-gram is a specific noun of a sub-phrase, which is a special phrase obtained by processing the stroke order and used as a feature in a word embedding method. In the preceding paragraph, it is explained in detail how the 3-gram operates and the 3-gram sub-phrase is obtained.
2) Character-form feature
Since chinese is a morpheme language derived from oracle (a pictographic character), its spatial structure, i.e., glyphs, can also convey rich semantic information. The key reason why Chinese characters are so rich in form information is that the same strokes can convey different semantic information in different combination modes in a two-dimensional space. As shown in the lower half of fig. 3, three chinese characters, "human", "enter" and "eight", have identical sequences of strokes, but they have completely different semantics because their spatial combinations of strokes are different. The method for extracting the character form features by using the convolutional neural network is as follows:
for the Chinese character c, according to the font image information, obtaining a corresponding font image I c Extracting the font characteristics by using a LeNet convolutional neural network:
the structure of the CNN network and related parameters are given below by way of example.
The CNN network includes: input layer, C1 layer (first convolution layer), S2 layer (first pooling layer), C3 layer (second convolution layer), S4 layer (second pooling layer), F5 layer (fully connected layer), output layer.
The parameters of each layer are as follows:
layer C1: 20 convolution kernels, each convolution kernel size 5x5;
s2 layer: maximum pooling core (MaxPooling), pool core size 2x2;
layer C3: 50 convolution kernels, each convolution kernel 5x5 in size;
s4 layer: maximum pooling core (MaxPooling), pool core size 2x2;
f5 layer: the dimension 500 is output.
2. Word level features
The morphological information in Chinese consists of two parts: the one-dimensional sequence information represented by the stroke order and the two-dimensional space information represented by the font need to be combined when the character level features are represented.
In the embodiment of the invention, the two are combined by adopting the operation of component combination, and the method is as follows:
for Chinese character c, the stroke order of the one-dimensional sequence channel is characterized by a sub-word combination G (c), and each element contains a sub-word feature vectorThe two-dimensional spatial channel is characterized by a character pattern CNN (I c ) The method comprises the steps of carrying out a first treatment on the surface of the Obtaining character level characteristic representation of Chinese character c by using component combination operation>
Where x is a component combination operator, there are various choices, such as addition and dot product.
3. Word level features
The word level features are obtained by fusing word level features: the Chinese characters in each word are accumulated and summed (N c For the number of Chinese characters contained in each word), a word-level feature representation of the word-level features is obtainedReunion word level characterization->Component combination is performed to obtain word level characteristics +.>Namely:
here, theRepresenting vector addition. Through the characteristic characterization of the three granularities, the characterization vector +.>
In the training stage, optimizing and training the model by utilizing a preset crawled Chinese text corpus data set D with a specified number;
for word w, from the distribution P (typically the unigram distribution unigram distribution), a negative set of samples of size λ (typically 5) is extracted, and the final optimization objective is optimized using maximum likelihood estimation:
wherein s (w, e) represents a similarity function in the skip word model, w is a central word, e is a window background word of the central word w, T (w) is a context window word set of the central word w, lambda is the number of negative samples of each central word w, e' is a negative sample noise word obtained by negative sampling,is the desired function term, σ is the sigmoid function.
In the embodiment of the invention, the similarity function in the skip word model is expressed as follows:
And after training, evaluating the performance of the model by using the test task data set.
It will be appreciated by those skilled in the art that both the training phase and the testing phase are performed in the manner described above with reference to steps 1-2.
According to the scheme provided by the embodiment of the invention, for a Chinese text, the text can be segmented and mapped into three granularity characteristics by utilizing a Chinese word segmentation tool and the stroke order and character shape characteristics of Chinese characters, namely: the character sequence comprises a character sequence contained in each character, a stroke order sequence corresponding to each character and a character pattern picture. The word sequence and the character sequence are two most commonly used features in a word embedding task, the stroke sequence and the character pattern of the characters are two very important features in Chinese morphology, the two features respectively describe one-dimensional sequence morphological features and two-dimensional space morphological features of Chinese, and more Chinese semantic information is more implicit and low-level. The low-level stroke order features and the font features are fused and then are combined upwards step by step, so that the Chinese morphological features are fused into the modeling process of word embedding, and richer language features are provided for the word embedding model.
According to the technical scheme provided by the invention, the word embedding modeling is performed on the Chinese text by using the morphological double-channel Chinese word embedding method based on strokes and fonts, and compared with the traditional processing method, the method can more effectively perform vector characterization on the Chinese text by means of stroke order and font information, thereby providing richer morphological information, good interpretability and better downstream characteristic data for natural language processing tasks of the Internet. The method has a certain practical application value and can bring certain potential economic benefits to some related text information platforms.
The above description is mainly directed to related schemes of the invention, and the following description is directed to related technologies of word embedding tasks so as to facilitate understanding of the invention.
For word embedding tasks, the goal is to represent each word in the text as a vector of fixed dimensions and to make these vectors better express the similarity and analogy relationships between different words, i.e. for two words x, y. The similarity is defined as the cosine value of the included angle between the vectorized representation x and y, namely the cosine similarity:the upper part of the formula is denoted as similarity function s (x, y) =x·y. More formally, the task is to give a set of Chinese text data, learn and iteratively update the embedded representation of words by assuming that the words appearing simultaneously in the text data have greater similarity, so as to obtain an embedded representation of words, so that the embedded representation can have better accuracy when applied to similarity and analogy tasks. For example, by looking up a dictionary, in a similarity or analogy task, find the word vector of the word to be compared in the task, e.g., compare two words, king and queen, then find the word vectors of the two words respectively, and calculate their cosine similarity.
An expression of a character skip model is used, and a text is taken as an example, and the character skip model is marked as S. Firstly, a word segmentation tool is used for segmenting the word into word sequences T, and a 'lark flies over from the blue sky'. And setting the window size of the background word as 2 by taking the blue sky as a central word c. The problem is embodied in that, given a central word c, a conditional probability of a background word that is no more than two words away from it is generated. More specifically, each word is represented as two d-dimensional vectors, which are used to calculate the conditional probability. Assuming that a word indexes i in the dictionary, a vector is represented as if it were a center wordAnd the vector is expressed as ++for the background word>Let the center word be w c The vector is denoted +.>The background word is w e The vector is denoted +.>The conditional probability of a given center word generating a background word can be obtained by softmax operation on the vector inner product:
wherein the dictionary index setGiven a text sequence of length N, let the word of time step t be w (t) . Assuming that the generation of the background words is independent under the condition of the given central word, when the window background size is m, the probability of generating all the background words by the given arbitrary central word is given, namely the likelihood function is:
all time steps less than 1 and greater than N are ignored. The goal is to maximize the likelihood function described above.
However, the events contained in the above model only consider positive class samples. This results in that the joint probability above is maximized to 1 when all word vectors are equal and have an infinite value. It is clear that such a word vector is meaningless. Thus, for each word w, we extract a negative set of samples T (w) of size λ (typically 5) from the distribution P (typically a unitary distribution), while the events containing positive and negative samples are independent of each other, while letting:
wherein P (d= 1|w) (t) ,w (t+j) )=σ(s(w (t) ,w (t+j) ) Representing the background word w (t+j) Appear in the center word w (t) Is a sigmoid function
The joint probability of the positive sample to be considered to the maximum degree is rewritten into a log-likelihood function to obtain an objective function to be optimized
From the description of the above embodiments, it will be apparent to those skilled in the art that the above embodiments may be implemented in software, or may be implemented by means of software plus a necessary general hardware platform. With such understanding, the technical solutions of the foregoing embodiments may be embodied in a software product, where the software product may be stored in a nonvolatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.), and include several instructions for causing a computer device (may be a personal computer, a server, or a network device, etc.) to perform the methods of the embodiments of the present invention.
The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions easily contemplated by those skilled in the art within the scope of the present invention should be included in the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.
Claims (4)
1. A morphological double-channel Chinese word embedding method based on strokes and fonts is characterized by comprising the following steps:
obtaining a Chinese text, and obtaining a corresponding word sequence through preprocessing;
splitting each word in the word sequence into a plurality of Chinese characters, and modeling according to the stroke order information and the font picture information of the Chinese characters and aiming at the character level morphological characteristics, the character level characteristics and the extraction process of the character level characteristics so as to obtain word embedding expression suitable for the characteristics of the Chinese characters;
the word-level morphological features include: the stroke order characteristic of the one-dimensional sequence channel and the character shape characteristic of the two-dimensional space channel;
the extraction mode of the stroke order characteristic of the one-dimensional sequence channel in the word-level morphological characteristic comprises the following steps: for the Chinese character c, determining the stroke order of the Chinese character c according to the stroke order information of the Chinese character to obtain a corresponding stroke sequence; setting a sliding window with the size of n to extract the subword combination of the stroke order; adding boundary symbols < > and < > to the head and the tail of the stroke sequence of the Chinese character c to obtain a new stroke sequence; sequentially disassembling a plurality of stroke combinations by taking n strokes as a group from front to back, and taking the stroke order added with the boundary sign as a special subword; the sub-word combination contained in the final Chinese character c is marked as G (c);
the extraction mode of the font characteristic of the two-dimensional space channel in the character level morphological characteristic comprises the following steps: for the Chinese character c, according to the font image information, obtaining a corresponding font image I c Extracting the character pattern features by using a CNN network:
the character level features are obtained by fusing the stroke order features of the one-dimensional sequence channels and the font features of the two-dimensional space channels in the character level morphological features; for Chinese character c, the stroke order of the one-dimensional sequence channel is characterized by a sub-word combination G (c), and each element contains a sub-word feature vectorThe two-dimensional spatial channel is characterized by a character pattern CNN (I c ) The method comprises the steps of carrying out a first treatment on the surface of the Operation of using composition combination to obtain character level characteristic representation of Chinese character>
Wherein, is a constituent combination operator;
the word level features are obtained by fusing word level features: accumulating and summing the Chinese characters in each word to obtain word-level feature representation in word-level featuresReunion word level characterization->Component combination is performed to obtain word level characteristics +.>Namely:
wherein ,Nc For the number of chinese characters contained in each word,representing vector addition;
optimizing and training the model by utilizing a preset crawled Chinese text corpus data set D with a specified number;
for word w, a negative set of samples T (w) of size λ is extracted from the distribution P, and the final optimization objective is optimized using maximum likelihood estimation:
wherein s (w, e) represents a similarity function in the skip word model, w is a central word, e is a window background word of the central word w, T (w) is a context window word set of the central word w, lambda is the number of negative samples of each central word w, e' is a negative sample noise word obtained by negative sampling,is the desired function term, σ is the sigmoid function.
2. The method for embedding the Chinese word in the morphological dual-channel mode based on strokes and fonts according to claim 1, wherein the preprocessing mode comprises the following steps:
word segmentation processing is carried out on the Chinese text;
removing texts with the word numbers smaller than a set value in the word segmentation result;
and removing the stop word to obtain a corresponding word sequence.
3. The method for embedding the Chinese word in the morphology dual-channel mode based on strokes and fonts according to claim 1, wherein the stroke order information and the font information of the Chinese character are crawled from open source dictionary data in advance.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910881062.0A CN110610006B (en) | 2019-09-18 | 2019-09-18 | Morphological double-channel Chinese word embedding method based on strokes and fonts |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910881062.0A CN110610006B (en) | 2019-09-18 | 2019-09-18 | Morphological double-channel Chinese word embedding method based on strokes and fonts |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110610006A CN110610006A (en) | 2019-12-24 |
CN110610006B true CN110610006B (en) | 2023-06-20 |
Family
ID=68892871
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910881062.0A Active CN110610006B (en) | 2019-09-18 | 2019-09-18 | Morphological double-channel Chinese word embedding method based on strokes and fonts |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110610006B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111476036A (en) * | 2020-04-10 | 2020-07-31 | 电子科技大学 | Word embedding learning method based on Chinese word feature substrings |
CN111539437B (en) * | 2020-04-27 | 2022-06-28 | 西南大学 | Detection and identification method of oracle-bone inscription components based on deep learning |
CN113505784B (en) * | 2021-06-11 | 2024-07-12 | 清华大学 | Automatic nail labeling analysis method and device, electronic equipment and storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107273355A (en) * | 2017-06-12 | 2017-10-20 | 大连理工大学 | A kind of Chinese word vector generation method based on words joint training |
CN109408814A (en) * | 2018-09-30 | 2019-03-01 | 中国地质大学(武汉) | Across the language vocabulary representative learning method and system of China and Britain based on paraphrase primitive word |
CN109858039A (en) * | 2019-03-01 | 2019-06-07 | 北京奇艺世纪科技有限公司 | A kind of text information identification method and identification device |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101408873A (en) * | 2007-10-09 | 2009-04-15 | 劳英杰 | Full scope semantic information integrative cognition system and application thereof |
CN109992783B (en) * | 2019-04-03 | 2020-10-30 | 同济大学 | Chinese word vector modeling method |
-
2019
- 2019-09-18 CN CN201910881062.0A patent/CN110610006B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107273355A (en) * | 2017-06-12 | 2017-10-20 | 大连理工大学 | A kind of Chinese word vector generation method based on words joint training |
CN109408814A (en) * | 2018-09-30 | 2019-03-01 | 中国地质大学(武汉) | Across the language vocabulary representative learning method and system of China and Britain based on paraphrase primitive word |
CN109858039A (en) * | 2019-03-01 | 2019-06-07 | 北京奇艺世纪科技有限公司 | A kind of text information identification method and identification device |
Also Published As
Publication number | Publication date |
---|---|
CN110610006A (en) | 2019-12-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108614875B (en) | Chinese emotion tendency classification method based on global average pooling convolutional neural network | |
Cao et al. | A joint model for word embedding and word morphology | |
CN110096698B (en) | Topic-considered machine reading understanding model generation method and system | |
CN109960804B (en) | Method and device for generating topic text sentence vector | |
CN110610006B (en) | Morphological double-channel Chinese word embedding method based on strokes and fonts | |
Fonseca et al. | Mac-morpho revisited: Towards robust part-of-speech tagging | |
CN111966812B (en) | Automatic question answering method based on dynamic word vector and storage medium | |
US10810467B2 (en) | Flexible integrating recognition and semantic processing | |
KR101988165B1 (en) | Method and system for improving the accuracy of speech recognition technology based on text data analysis for deaf students | |
CN113255331B (en) | Text error correction method, device and storage medium | |
CN112861524A (en) | Deep learning-based multilevel Chinese fine-grained emotion analysis method | |
CN112966525B (en) | Law field event extraction method based on pre-training model and convolutional neural network algorithm | |
Sifa et al. | Towards contradiction detection in german: a translation-driven approach | |
Theeramunkong et al. | Non-dictionary-based Thai word segmentation using decision trees | |
CN110929022A (en) | Text abstract generation method and system | |
CN111159405B (en) | Irony detection method based on background knowledge | |
CN111191413B (en) | Method, device and system for automatically marking event core content based on graph sequencing model | |
CN112287240A (en) | Case microblog evaluation object extraction method and device based on double-embedded multilayer convolutional neural network | |
Wang et al. | Listen, Decipher and Sign: Toward Unsupervised Speech-to-Sign Language Recognition | |
Jindal | A deep learning approach for arabic caption generation using roots-words | |
CN113987120A (en) | Public sentiment emotion classification method based on deep learning | |
CN109446334A (en) | A kind of method that realizing English Text Classification and relevant device | |
CN115878847B (en) | Video guiding method, system, equipment and storage medium based on natural language | |
Zhao et al. | Commented content classification with deep neural network based on attention mechanism | |
CN115130475A (en) | Extensible universal end-to-end named entity identification method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |