CN110287483B

CN110287483B - Unregistered word recognition method and system utilizing five-stroke character root deep learning

Info

Publication number: CN110287483B
Application number: CN201910492347.5A
Authority: CN
Inventors: 肖政宏; 闫艺婷; 王华嘉; 周健烨; 李旺; 梁志鹏
Original assignee: Guangdong Polytechnic Normal University
Current assignee: Guangdong Polytechnic Normal University
Priority date: 2019-06-06
Filing date: 2019-06-06
Publication date: 2023-12-05
Anticipated expiration: 2039-06-06
Also published as: CN110287483A

Abstract

The invention belongs to the technical field of processing natural language data, and discloses an unregistered word recognition method and system for deep learning by utilizing five strokes of roots, which are used for converting Chinese characters into 4 English letters according to a five strokes of roots table; then training a neural network model by taking the embedded vector as the embedded vector of the model and inputting the embedded vector corresponding to the word of the corpus; finally, the model outputs the closest word vector in the previous corpus, and the word vector is used as an important basis for identifying the unregistered words, so that the unregistered words can be better identified. The invention provides a neural network entity recognition method by utilizing the radicals of five strokes, which can improve the performance of recognizing unregistered words by a neural network model by utilizing the Chinese character words with similar radicals, most of which have the same part of speech and similar five strokes codes. The invention uses word vectors to represent words based on deep learning, solves the sparse problem of high latitude vector space, and is simpler and more effective.

Description

Unregistered word recognition method and system utilizing five-stroke character root deep learning

Technical Field

The invention belongs to the technical field of processing natural language data, and particularly relates to an unregistered word recognition method and system utilizing five-stroke character root deep learning.

Background

Currently, the current state of the art commonly used in the industry is as follows: the "named entity" widely used in the field of natural language processing at present was originally proposed in 1996 at the sixth information understanding conference, and most of the research of MUC-6 is based on rule methods, such as: word shape or word part vocabulary rules. And formulating character matching rules according to prompt words, context and the like before and after the named entity, and mainly focusing on an information extraction task. Named entities are the 7 subclasses of named entities that Sekine considers general for objects of interest to solve a particular problem and do not meet the application requirements for automatic question-answering and information retrieval.

In Chinese word segmentation, the unregistered words (Out of Vocabulary, OOV) are very important factors influencing word segmentation effect, and the named entities are the most obvious one of the unregistered words, so the named entities are the problem that Chinese automatic word segmentation cannot avoid. The rule-based method requires to manually formulate a plurality of rules, has low feasibility, has poor portability when the application fields have large differences, and requires to reformulate the rules; two ideas exist in the machine learning-based method, one is that all named entity boundaries in a text are recognized first, and then the entities in the text are classified by using a model; the other is a serialization labeling method, each word in the corpus can be provided with a plurality of candidate class labels, and the labels correspond to positions in various named entities and cannot identify the unregistered words.

In the existing recognition model, the neural network model (such as LSTM, RNN and the like) entity recognition shows stronger competitiveness. Because the neural network model uses characters in the training set as basic input units, the login words are easy to identify, and the test results on the experimental data set also verify that the model can identify the login words, but the method cannot identify the non-login words well.

In summary, the problems of the prior art are:

(1) The rule-based method requires manual formulation of a plurality of rules, has low feasibility, and has poor portability when the application fields are very different, and the rules need to be re-formulated.

(2) The recognition method based on the machine learning method and the neural network model cannot recognize the unknown words.

The difficulty of solving the technical problems is as follows:

with the research of the academic community on the named entity recognition, the named entity recognition can be performed according to different models and algorithms.

Meaning of solving the technical problems:

currently, the terminology of each field is huge in category, content is generalized, information quantity is large, and constitution is complex. Thus, people cannot accurately and completely describe or express the Chinese character by using a plurality of modes such as alias, shorthand, word and the like, and problems occur, and misuse of wrongly written characters, ambiguous words, close meanings and the like are often mixed. This can have a serious impact on the name recognition of the field. In conclusion, the five-stroke character root is used for identifying the unregistered words, so that the method has important significance and practical application value. The model provided by the invention utilizes the characteristics of the five-stroke character roots. Compared with the traditional model using word vectors, the model provided by the invention can well avoid the influence caused by word segmentation errors.

Disclosure of Invention

Aiming at the problems existing in the prior art, the invention provides a method and a system for identifying unregistered words by utilizing five-stroke character root deep learning.

The invention is realized in such a way that the method for identifying the unregistered word by utilizing the five-stroke word root deep learning specifically comprises the following steps:

embedding and merging the wubi into original characters, and constructing a comprehensive character representation for each character in an input sentence;

searching an embedded alphabet of the English letter corresponding to the character;

automatically extracting n-gram characteristics of character information by using a cnn neural network, and simulating different n-gram characteristics by generating different characteristic mapping sets; dividing each character into strokes to generate an n-gram model containing character representations;

step four, adopting convolution neural networks of filters with different sizes to simulate a traditional n-gram model;

inputting the character vector into an LSTM neural network model for training, and carrying out context information and modeling on each English letter in the character;

and step six, merging the character vectors, embedding the radical integrated characters into an output terminal provided for the LSTM neural network, so as to decode and predict the final tag sequence of the input sentence.

Further, in the first step, the constructing a comprehensive character representation for each character specifically includes:

for each Chinese character, converting into 4 English letters according to a five-stroke radical table, and adding "·" as filling for Chinese characters with parts less than 4 English letters.

Another object of the present invention is to provide an unregistered word recognition system using wubi root deep learning based on the unregistered word recognition method using wubi root deep learning, the unregistered word recognition system using wubi root deep learning including:

the character construction module is used for embedding and combining the wubi into the original characters and constructing a comprehensive character representation for each character in the input sentence;

the character searching module is used for searching an embedded alphabet corresponding to English letters;

the model construction module is used for automatically extracting the n-gram characteristics of the character information by applying the cnn neural network and simulating different n-gram characteristics by generating different characteristic mapping sets; dividing each character into strokes to generate an n-gram model containing character representations;

the model simulation module is used for simulating a traditional n-gram model by adopting convolutional neural networks of filters with different sizes;

the training module is used for inputting the character vector into the LSTM neural network model for training, and carrying out context information and modeling on each English letter in the character;

and the character embedding module is used for merging the character vectors, embedding the radical integrated characters into an output end provided for the LSTM neural network, and decoding and predicting the final marking sequence of the input sentence.

It is another object of the present invention to provide a computer program applying the unregistered word recognition method using wubi root deep learning.

Another object of the present invention is to provide an information data processing terminal implementing the unregistered word recognition method using wubi root deep learning.

It is another object of the present invention to provide a computer readable storage medium comprising instructions that when executed on a computer cause the computer to perform the method of unregistered word recognition using wurtzite deep learning.

In summary, the invention has the advantages and positive effects that:

the invention provides a neural network entity recognition method by utilizing the radicals of five strokes, which can improve the performance of recognizing unregistered words by a neural network model by utilizing the Chinese character words with similar radicals, most of which have the same part of speech and similar five strokes codes.

The invention uses word vectors to represent words based on deep learning, solves the sparse problem of high latitude vector space, and the word vectors per se contain more semantic information than manually selected features, and can acquire the feature representation of unified vector space from the text fused by multi-source heterogeneous data, thereby being simpler and more effective.

The invention converts word embedding into letter embedding, and converts each Chinese character into 4 English letters by utilizing the principle that five strokes of Chinese characters with the same meaning are similar in coding, thereby improving the performance of the neural network model in identifying the unregistered words.

The invention can replace the strokes, and the strokes of each Chinese character are used as words to be embedded, so that the accuracy of identifying the unregistered words by the model can be improved; meanwhile, the main stream level can be achieved only by the word vector and the character vector, and the effect can be further improved by adding high-quality dictionary features.

The invention combines LSTM and five-stroke character root model for identifying Chinese naming entity. The model encodes the input character sequence and all potential vocabularies matching the wubi root dictionary. In contrast to character-based methods, the present invention explicitly utilizes word and word order information. The gating loop unit enables the model to select the most relevant characters and words from sentences to generate better named entity recognition results.

The invention uses five-stroke character roots to represent Chinese characters, and the representations are combined as character embedding, so that the form and semantic information of exploring characters can be enhanced; according to the invention, the n-gram characteristics are automatically extracted by using the neural network, each character is divided into strokes to provide an n-gram model, each character is represented by 4 English letters, and fuzzy information can be brought to different characters with the same type, so that the performance of identifying the unregistered words by an algorithm is improved.

The invention adopts the five-stroke representation method and embeds the integrated character of the character root to form the final input, and then adopts the convolution neural network of the filters with different sizes to simulate the traditional n-gram model, thereby being beneficial to identifying the unregistered words.

The five-stroke method provided by the invention can distinguish words with similar structures. If the characters are less than four English letters, the blank letters can be used for filling initialization embedding to ensure that each character has four stroke level representations, and stroke input vector values are continuously updated during training of the model, so that the performance of the model can be enhanced.

Drawings

FIG. 1 is a flowchart of an unregistered word recognition method using five-stroke radical deep learning according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of an unregistered word recognition method using five-stroke radical deep learning according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the following examples in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

The technical scheme of the invention is described in detail below with reference to the accompanying drawings.

As shown in fig. 1-2, the method for identifying the unregistered word by using the deep learning of the five-stroke character root provided by the embodiment of the present invention specifically includes:

s101: embedding and merging the wubi into the original character, and constructing a comprehensive character representation for each character in the input sentence; for each Chinese character, converting into 4 English letters according to a five-stroke radical table, and adding "·" as filling for Chinese characters with parts less than 4 English letters before or after;

s102: searching an embedded alphabet of the English letters corresponding to the characters;

s103: automatically extracting n-gram characteristics of character information by using a cnn neural network, and simulating different n-gram characteristics by generating different characteristic mapping sets; dividing each character into strokes to generate an n-gram model containing character representations;

s104: adopting convolution neural networks of filters with different sizes to simulate a traditional n-gram model;

s105: inputting the character vector into an LSTM neural network model for training, and carrying out context information and modeling on each English letter in the character;

s106: the character vectors are combined and the root integrated character is embedded into an output provided to the LSTM neural network to decode and predict the final tag sequence of the input sentence.

The technical scheme of the invention is further described below with reference to specific embodiments.

Example 1:

the invention combines LSTM and five-stroke character root model for identifying Chinese naming entity. The present invention encodes the input character sequence and all potential vocabularies that match the wubi root dictionary. In contrast to character-based methods, the present invention explicitly utilizes word and word order information. The gating loop unit enables the model to select the most relevant characters and words from sentences to generate better named entity recognition results.

In the aspect of word input embedding, the embodiment of the invention utilizes five-stroke character roots to represent Chinese characters, and the representations are combined as character embedding, so that the form and semantic information of explored characters can be enhanced, and n-gram characteristics can be automatically extracted by using a neural network. Each character is divided into strokes to propose an n-gram model, each character being represented by 4 english letters. For different characters with the same type, the implementation of the method can bring fuzzy information, so that the performance of the algorithm for identifying the unregistered words is improved.

Table 1 comparison of two character encoding methods

Word(s)	Five-stroke representation
		Exquisite bell	Wang Ren and B (gwyc)
Bell with bell	Gold man, B (qwyc)

The embodiment of the invention adopts a five-stroke representation method and embeds the integrated character of the character root to form final input, and then adopts the convolution neural network of filters with different sizes to simulate the traditional n-gram model, thereby being beneficial to identifying the unregistered words.

Named entity recognition is widely applied to various fields, such as recognizing person names and place names from a sentence, recognizing product names from medical drugs, recognizing product names from e-commerce sales searches, and the like. The invention combines a long-term memory circulation network with the five-stroke character root model, has better performance for identifying the named entity in the field of financial insurance, and improves the accuracy of identifying the insurance name.

It should be noted that the embodiments of the present invention can be realized in hardware, software, or a combination of software and hardware. The hardware portion may be implemented using dedicated logic; the software portions may be stored in a memory and executed by a suitable instruction execution system, such as a microprocessor or special purpose design hardware. Those of ordinary skill in the art will appreciate that the apparatus and methods described above may be implemented using computer executable instructions and/or embodied in processor control code, such as provided on a carrier medium such as a magnetic disk, CD or DVD-ROM, a programmable memory such as read only memory (firmware), or a data carrier such as an optical or electronic signal carrier. The device of the present invention and its modules may be implemented by hardware circuitry, such as very large scale integrated circuits or gate arrays, semiconductors such as logic chips, transistors, etc., or programmable hardware devices such as field programmable gate arrays, programmable logic devices, etc., as well as software executed by various types of processors, or by a combination of the above hardware circuitry and software, such as firmware.

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, and alternatives falling within the spirit and principles of the invention.

Claims

1. The method for identifying the unregistered word by utilizing the five-stroke word root deep learning is characterized by comprising the following steps of:

step six, merging the character vectors, embedding the integrated character of the radical into an output end provided for the LSTM neural network, so as to decode and predict a final marking sequence of the input sentence;

in the first step, the construction of a comprehensive character representation for each character specifically includes: for each Chinese character, converting into 4 English letters according to a five-stroke radical table, and adding "·" as filling for Chinese characters with parts less than 4 English letters.

2. An unregistered word recognition system using wubi root deep learning based on the unregistered word recognition method using wubi root deep learning of claim 1, wherein the unregistered word recognition system using wubi root deep learning includes:

3. An information data processing terminal implementing the unregistered word recognition method using wubi root deep learning according to claim 1.

4. A computer readable storage medium comprising instructions that when executed on a computer cause the computer to perform the method of unregistered word recognition with wubi root deep learning of claim 1.