CN110287483B - Unregistered word recognition method and system utilizing five-stroke character root deep learning - Google Patents

Unregistered word recognition method and system utilizing five-stroke character root deep learning Download PDF

Info

Publication number
CN110287483B
CN110287483B CN201910492347.5A CN201910492347A CN110287483B CN 110287483 B CN110287483 B CN 110287483B CN 201910492347 A CN201910492347 A CN 201910492347A CN 110287483 B CN110287483 B CN 110287483B
Authority
CN
China
Prior art keywords
character
deep learning
neural network
model
wubi
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910492347.5A
Other languages
Chinese (zh)
Other versions
CN110287483A (en
Inventor
肖政宏
闫艺婷
王华嘉
周健烨
李旺
梁志鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Polytechnic Normal University
Original Assignee
Guangdong Polytechnic Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Polytechnic Normal University filed Critical Guangdong Polytechnic Normal University
Priority to CN201910492347.5A priority Critical patent/CN110287483B/en
Publication of CN110287483A publication Critical patent/CN110287483A/en
Application granted granted Critical
Publication of CN110287483B publication Critical patent/CN110287483B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Abstract

The invention belongs to the technical field of processing natural language data, and discloses an unregistered word recognition method and system for deep learning by utilizing five strokes of roots, which are used for converting Chinese characters into 4 English letters according to a five strokes of roots table; then training a neural network model by taking the embedded vector as the embedded vector of the model and inputting the embedded vector corresponding to the word of the corpus; finally, the model outputs the closest word vector in the previous corpus, and the word vector is used as an important basis for identifying the unregistered words, so that the unregistered words can be better identified. The invention provides a neural network entity recognition method by utilizing the radicals of five strokes, which can improve the performance of recognizing unregistered words by a neural network model by utilizing the Chinese character words with similar radicals, most of which have the same part of speech and similar five strokes codes. The invention uses word vectors to represent words based on deep learning, solves the sparse problem of high latitude vector space, and is simpler and more effective.

Description

Unregistered word recognition method and system utilizing five-stroke character root deep learning
Technical Field
The invention belongs to the technical field of processing natural language data, and particularly relates to an unregistered word recognition method and system utilizing five-stroke character root deep learning.
Background
Currently, the current state of the art commonly used in the industry is as follows: the "named entity" widely used in the field of natural language processing at present was originally proposed in 1996 at the sixth information understanding conference, and most of the research of MUC-6 is based on rule methods, such as: word shape or word part vocabulary rules. And formulating character matching rules according to prompt words, context and the like before and after the named entity, and mainly focusing on an information extraction task. Named entities are the 7 subclasses of named entities that Sekine considers general for objects of interest to solve a particular problem and do not meet the application requirements for automatic question-answering and information retrieval.
In Chinese word segmentation, the unregistered words (Out of Vocabulary, OOV) are very important factors influencing word segmentation effect, and the named entities are the most obvious one of the unregistered words, so the named entities are the problem that Chinese automatic word segmentation cannot avoid. The rule-based method requires to manually formulate a plurality of rules, has low feasibility, has poor portability when the application fields have large differences, and requires to reformulate the rules; two ideas exist in the machine learning-based method, one is that all named entity boundaries in a text are recognized first, and then the entities in the text are classified by using a model; the other is a serialization labeling method, each word in the corpus can be provided with a plurality of candidate class labels, and the labels correspond to positions in various named entities and cannot identify the unregistered words.
In the existing recognition model, the neural network model (such as LSTM, RNN and the like) entity recognition shows stronger competitiveness. Because the neural network model uses characters in the training set as basic input units, the login words are easy to identify, and the test results on the experimental data set also verify that the model can identify the login words, but the method cannot identify the non-login words well.
In summary, the problems of the prior art are:
(1) The rule-based method requires manual formulation of a plurality of rules, has low feasibility, and has poor portability when the application fields are very different, and the rules need to be re-formulated.
(2) The recognition method based on the machine learning method and the neural network model cannot recognize the unknown words.
The difficulty of solving the technical problems is as follows:
with the research of the academic community on the named entity recognition, the named entity recognition can be performed according to different models and algorithms.
Meaning of solving the technical problems:
currently, the terminology of each field is huge in category, content is generalized, information quantity is large, and constitution is complex. Thus, people cannot accurately and completely describe or express the Chinese character by using a plurality of modes such as alias, shorthand, word and the like, and problems occur, and misuse of wrongly written characters, ambiguous words, close meanings and the like are often mixed. This can have a serious impact on the name recognition of the field. In conclusion, the five-stroke character root is used for identifying the unregistered words, so that the method has important significance and practical application value. The model provided by the invention utilizes the characteristics of the five-stroke character roots. Compared with the traditional model using word vectors, the model provided by the invention can well avoid the influence caused by word segmentation errors.
Disclosure of Invention
Aiming at the problems existing in the prior art, the invention provides a method and a system for identifying unregistered words by utilizing five-stroke character root deep learning.
The invention is realized in such a way that the method for identifying the unregistered word by utilizing the five-stroke word root deep learning specifically comprises the following steps:
embedding and merging the wubi into original characters, and constructing a comprehensive character representation for each character in an input sentence;
searching an embedded alphabet of the English letter corresponding to the character;
automatically extracting n-gram characteristics of character information by using a cnn neural network, and simulating different n-gram characteristics by generating different characteristic mapping sets; dividing each character into strokes to generate an n-gram model containing character representations;
step four, adopting convolution neural networks of filters with different sizes to simulate a traditional n-gram model;
inputting the character vector into an LSTM neural network model for training, and carrying out context information and modeling on each English letter in the character;
and step six, merging the character vectors, embedding the radical integrated characters into an output terminal provided for the LSTM neural network, so as to decode and predict the final tag sequence of the input sentence.
Further, in the first step, the constructing a comprehensive character representation for each character specifically includes:
for each Chinese character, converting into 4 English letters according to a five-stroke radical table, and adding "·" as filling for Chinese characters with parts less than 4 English letters.
Another object of the present invention is to provide an unregistered word recognition system using wubi root deep learning based on the unregistered word recognition method using wubi root deep learning, the unregistered word recognition system using wubi root deep learning including:
the character construction module is used for embedding and combining the wubi into the original characters and constructing a comprehensive character representation for each character in the input sentence;
the character searching module is used for searching an embedded alphabet corresponding to English letters;
the model construction module is used for automatically extracting the n-gram characteristics of the character information by applying the cnn neural network and simulating different n-gram characteristics by generating different characteristic mapping sets; dividing each character into strokes to generate an n-gram model containing character representations;
the model simulation module is used for simulating a traditional n-gram model by adopting convolutional neural networks of filters with different sizes;
the training module is used for inputting the character vector into the LSTM neural network model for training, and carrying out context information and modeling on each English letter in the character;
and the character embedding module is used for merging the character vectors, embedding the radical integrated characters into an output end provided for the LSTM neural network, and decoding and predicting the final marking sequence of the input sentence.
It is another object of the present invention to provide a computer program applying the unregistered word recognition method using wubi root deep learning.
Another object of the present invention is to provide an information data processing terminal implementing the unregistered word recognition method using wubi root deep learning.
It is another object of the present invention to provide a computer readable storage medium comprising instructions that when executed on a computer cause the computer to perform the method of unregistered word recognition using wurtzite deep learning.
In summary, the invention has the advantages and positive effects that:
the invention provides a neural network entity recognition method by utilizing the radicals of five strokes, which can improve the performance of recognizing unregistered words by a neural network model by utilizing the Chinese character words with similar radicals, most of which have the same part of speech and similar five strokes codes.
The invention uses word vectors to represent words based on deep learning, solves the sparse problem of high latitude vector space, and the word vectors per se contain more semantic information than manually selected features, and can acquire the feature representation of unified vector space from the text fused by multi-source heterogeneous data, thereby being simpler and more effective.
The invention converts word embedding into letter embedding, and converts each Chinese character into 4 English letters by utilizing the principle that five strokes of Chinese characters with the same meaning are similar in coding, thereby improving the performance of the neural network model in identifying the unregistered words.
The invention can replace the strokes, and the strokes of each Chinese character are used as words to be embedded, so that the accuracy of identifying the unregistered words by the model can be improved; meanwhile, the main stream level can be achieved only by the word vector and the character vector, and the effect can be further improved by adding high-quality dictionary features.
The invention combines LSTM and five-stroke character root model for identifying Chinese naming entity. The model encodes the input character sequence and all potential vocabularies matching the wubi root dictionary. In contrast to character-based methods, the present invention explicitly utilizes word and word order information. The gating loop unit enables the model to select the most relevant characters and words from sentences to generate better named entity recognition results.
The invention uses five-stroke character roots to represent Chinese characters, and the representations are combined as character embedding, so that the form and semantic information of exploring characters can be enhanced; according to the invention, the n-gram characteristics are automatically extracted by using the neural network, each character is divided into strokes to provide an n-gram model, each character is represented by 4 English letters, and fuzzy information can be brought to different characters with the same type, so that the performance of identifying the unregistered words by an algorithm is improved.
The invention adopts the five-stroke representation method and embeds the integrated character of the character root to form the final input, and then adopts the convolution neural network of the filters with different sizes to simulate the traditional n-gram model, thereby being beneficial to identifying the unregistered words.
The five-stroke method provided by the invention can distinguish words with similar structures. If the characters are less than four English letters, the blank letters can be used for filling initialization embedding to ensure that each character has four stroke level representations, and stroke input vector values are continuously updated during training of the model, so that the performance of the model can be enhanced.
Drawings
FIG. 1 is a flowchart of an unregistered word recognition method using five-stroke radical deep learning according to an embodiment of the present invention.
Fig. 2 is a schematic diagram of an unregistered word recognition method using five-stroke radical deep learning according to an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the following examples in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
The technical scheme of the invention is described in detail below with reference to the accompanying drawings.
As shown in fig. 1-2, the method for identifying the unregistered word by using the deep learning of the five-stroke character root provided by the embodiment of the present invention specifically includes:
s101: embedding and merging the wubi into the original character, and constructing a comprehensive character representation for each character in the input sentence; for each Chinese character, converting into 4 English letters according to a five-stroke radical table, and adding "·" as filling for Chinese characters with parts less than 4 English letters before or after;
s102: searching an embedded alphabet of the English letters corresponding to the characters;
s103: automatically extracting n-gram characteristics of character information by using a cnn neural network, and simulating different n-gram characteristics by generating different characteristic mapping sets; dividing each character into strokes to generate an n-gram model containing character representations;
s104: adopting convolution neural networks of filters with different sizes to simulate a traditional n-gram model;
s105: inputting the character vector into an LSTM neural network model for training, and carrying out context information and modeling on each English letter in the character;
s106: the character vectors are combined and the root integrated character is embedded into an output provided to the LSTM neural network to decode and predict the final tag sequence of the input sentence.
The technical scheme of the invention is further described below with reference to specific embodiments.
Example 1:
the invention combines LSTM and five-stroke character root model for identifying Chinese naming entity. The present invention encodes the input character sequence and all potential vocabularies that match the wubi root dictionary. In contrast to character-based methods, the present invention explicitly utilizes word and word order information. The gating loop unit enables the model to select the most relevant characters and words from sentences to generate better named entity recognition results.
In the aspect of word input embedding, the embodiment of the invention utilizes five-stroke character roots to represent Chinese characters, and the representations are combined as character embedding, so that the form and semantic information of explored characters can be enhanced, and n-gram characteristics can be automatically extracted by using a neural network. Each character is divided into strokes to propose an n-gram model, each character being represented by 4 english letters. For different characters with the same type, the implementation of the method can bring fuzzy information, so that the performance of the algorithm for identifying the unregistered words is improved.
Table 1 comparison of two character encoding methods
Word(s) Five-stroke representation
Exquisite bell Wang Ren and B (gwyc)
Bell with bell Gold man, B (qwyc)
The embodiment of the invention adopts a five-stroke representation method and embeds the integrated character of the character root to form final input, and then adopts the convolution neural network of filters with different sizes to simulate the traditional n-gram model, thereby being beneficial to identifying the unregistered words.
Named entity recognition is widely applied to various fields, such as recognizing person names and place names from a sentence, recognizing product names from medical drugs, recognizing product names from e-commerce sales searches, and the like. The invention combines a long-term memory circulation network with the five-stroke character root model, has better performance for identifying the named entity in the field of financial insurance, and improves the accuracy of identifying the insurance name.
It should be noted that the embodiments of the present invention can be realized in hardware, software, or a combination of software and hardware. The hardware portion may be implemented using dedicated logic; the software portions may be stored in a memory and executed by a suitable instruction execution system, such as a microprocessor or special purpose design hardware. Those of ordinary skill in the art will appreciate that the apparatus and methods described above may be implemented using computer executable instructions and/or embodied in processor control code, such as provided on a carrier medium such as a magnetic disk, CD or DVD-ROM, a programmable memory such as read only memory (firmware), or a data carrier such as an optical or electronic signal carrier. The device of the present invention and its modules may be implemented by hardware circuitry, such as very large scale integrated circuits or gate arrays, semiconductors such as logic chips, transistors, etc., or programmable hardware devices such as field programmable gate arrays, programmable logic devices, etc., as well as software executed by various types of processors, or by a combination of the above hardware circuitry and software, such as firmware.
The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, and alternatives falling within the spirit and principles of the invention.

Claims (4)

1. The method for identifying the unregistered word by utilizing the five-stroke word root deep learning is characterized by comprising the following steps of:
embedding and merging the wubi into original characters, and constructing a comprehensive character representation for each character in an input sentence;
searching an embedded alphabet of the English letter corresponding to the character;
automatically extracting n-gram characteristics of character information by using a cnn neural network, and simulating different n-gram characteristics by generating different characteristic mapping sets; dividing each character into strokes to generate an n-gram model containing character representations;
step four, adopting convolution neural networks of filters with different sizes to simulate a traditional n-gram model;
inputting the character vector into an LSTM neural network model for training, and carrying out context information and modeling on each English letter in the character;
step six, merging the character vectors, embedding the integrated character of the radical into an output end provided for the LSTM neural network, so as to decode and predict a final marking sequence of the input sentence;
in the first step, the construction of a comprehensive character representation for each character specifically includes: for each Chinese character, converting into 4 English letters according to a five-stroke radical table, and adding "·" as filling for Chinese characters with parts less than 4 English letters.
2. An unregistered word recognition system using wubi root deep learning based on the unregistered word recognition method using wubi root deep learning of claim 1, wherein the unregistered word recognition system using wubi root deep learning includes:
the character construction module is used for embedding and combining the wubi into the original characters and constructing a comprehensive character representation for each character in the input sentence;
the character searching module is used for searching an embedded alphabet corresponding to English letters;
the model construction module is used for automatically extracting the n-gram characteristics of the character information by applying the cnn neural network and simulating different n-gram characteristics by generating different characteristic mapping sets; dividing each character into strokes to generate an n-gram model containing character representations;
the model simulation module is used for simulating a traditional n-gram model by adopting convolutional neural networks of filters with different sizes;
the training module is used for inputting the character vector into the LSTM neural network model for training, and carrying out context information and modeling on each English letter in the character;
and the character embedding module is used for merging the character vectors, embedding the radical integrated characters into an output end provided for the LSTM neural network, and decoding and predicting the final marking sequence of the input sentence.
3. An information data processing terminal implementing the unregistered word recognition method using wubi root deep learning according to claim 1.
4. A computer readable storage medium comprising instructions that when executed on a computer cause the computer to perform the method of unregistered word recognition with wubi root deep learning of claim 1.
CN201910492347.5A 2019-06-06 2019-06-06 Unregistered word recognition method and system utilizing five-stroke character root deep learning Active CN110287483B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910492347.5A CN110287483B (en) 2019-06-06 2019-06-06 Unregistered word recognition method and system utilizing five-stroke character root deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910492347.5A CN110287483B (en) 2019-06-06 2019-06-06 Unregistered word recognition method and system utilizing five-stroke character root deep learning

Publications (2)

Publication Number Publication Date
CN110287483A CN110287483A (en) 2019-09-27
CN110287483B true CN110287483B (en) 2023-12-05

Family

ID=68003508

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910492347.5A Active CN110287483B (en) 2019-06-06 2019-06-06 Unregistered word recognition method and system utilizing five-stroke character root deep learning

Country Status (1)

Country Link
CN (1) CN110287483B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111126160B (en) * 2019-11-28 2023-04-07 天津瑟威兰斯科技有限公司 Intelligent Chinese character structure evaluation method and system constructed based on five-stroke input method
CN111523325A (en) * 2020-04-20 2020-08-11 电子科技大学 Chinese named entity recognition method based on strokes
CN112507190B (en) * 2020-12-17 2023-04-07 新华智云科技有限公司 Method and system for extracting keywords of financial and economic news

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106354701A (en) * 2016-08-30 2017-01-25 腾讯科技(深圳)有限公司 Chinese character processing method and device
CN107832289A (en) * 2017-10-12 2018-03-23 北京知道未来信息技术有限公司 A kind of name entity recognition method based on LSTM CNN
CN107977361A (en) * 2017-12-06 2018-05-01 哈尔滨工业大学深圳研究生院 The Chinese clinical treatment entity recognition method represented based on deep semantic information
CN108595592A (en) * 2018-04-19 2018-09-28 成都睿码科技有限责任公司 A kind of text emotion analysis method based on five-stroke form code character level language model
CN108829823A (en) * 2018-06-13 2018-11-16 北京信息科技大学 A kind of file classification method
CN108875021A (en) * 2017-11-10 2018-11-23 云南大学 A kind of sentiment analysis method based on region CNN-LSTM
CN109033042A (en) * 2018-06-28 2018-12-18 中译语通科技股份有限公司 BPE coding method and system, machine translation system based on the sub- word cell of Chinese
CN109388807A (en) * 2018-10-30 2019-02-26 中山大学 The method, apparatus and storage medium of electronic health record name Entity recognition
CN109597891A (en) * 2018-11-26 2019-04-09 重庆邮电大学 Text emotion analysis method based on two-way length Memory Neural Networks in short-term

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3433795A4 (en) * 2016-03-24 2019-11-13 Ramot at Tel-Aviv University Ltd. Method and system for converting an image to text

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106354701A (en) * 2016-08-30 2017-01-25 腾讯科技(深圳)有限公司 Chinese character processing method and device
CN107832289A (en) * 2017-10-12 2018-03-23 北京知道未来信息技术有限公司 A kind of name entity recognition method based on LSTM CNN
CN108875021A (en) * 2017-11-10 2018-11-23 云南大学 A kind of sentiment analysis method based on region CNN-LSTM
CN107977361A (en) * 2017-12-06 2018-05-01 哈尔滨工业大学深圳研究生院 The Chinese clinical treatment entity recognition method represented based on deep semantic information
CN108595592A (en) * 2018-04-19 2018-09-28 成都睿码科技有限责任公司 A kind of text emotion analysis method based on five-stroke form code character level language model
CN108829823A (en) * 2018-06-13 2018-11-16 北京信息科技大学 A kind of file classification method
CN109033042A (en) * 2018-06-28 2018-12-18 中译语通科技股份有限公司 BPE coding method and system, machine translation system based on the sub- word cell of Chinese
CN109388807A (en) * 2018-10-30 2019-02-26 中山大学 The method, apparatus and storage medium of electronic health record name Entity recognition
CN109597891A (en) * 2018-11-26 2019-04-09 重庆邮电大学 Text emotion analysis method based on two-way length Memory Neural Networks in short-term

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于改进模拟退火的神经网络降质图像恢复;潘梅森;《计算机工程与设计》;20061231;第27卷(第24期);第1-4页 *

Also Published As

Publication number Publication date
CN110287483A (en) 2019-09-27

Similar Documents

Publication Publication Date Title
CN111738004B (en) Named entity recognition model training method and named entity recognition method
CN108124477B (en) Improving word segmenters to process natural language based on pseudo data
JP2022028887A (en) Method, apparatus, electronic device and storage medium for correcting text errors
CN108804423B (en) Medical text feature extraction and automatic matching method and system
CN110287483B (en) Unregistered word recognition method and system utilizing five-stroke character root deep learning
CN111310441A (en) Text correction method, device, terminal and medium based on BERT (binary offset transcription) voice recognition
CN105512110B (en) A kind of wrongly written character word construction of knowledge base method based on fuzzy matching with statistics
CN111599340A (en) Polyphone pronunciation prediction method and device and computer readable storage medium
CN111666758A (en) Chinese word segmentation method, training device and computer readable storage medium
CN110807335A (en) Translation method, device, equipment and storage medium based on machine learning
CN114036950A (en) Medical text named entity recognition method and system
Wu et al. A multimodal attention fusion network with a dynamic vocabulary for TextVQA
CN111881256B (en) Text entity relation extraction method and device and computer readable storage medium equipment
CN114912450B (en) Information generation method and device, training method, electronic device and storage medium
CN112434520A (en) Named entity recognition method and device and readable storage medium
CN115935959A (en) Method for labeling low-resource glue word sequence
CN111597815A (en) Multi-embedded named entity identification method, device, equipment and storage medium
JP2018206262A (en) Word linking identification model learning device, word linking detection device, method and program
CN112632956A (en) Text matching method, device, terminal and storage medium
CN115906854A (en) Multi-level confrontation-based cross-language named entity recognition model training method
CN116029300A (en) Language model training method and system for strengthening semantic features of Chinese entities
CN115358227A (en) Open domain relation joint extraction method and system based on phrase enhancement
CN114298032A (en) Text punctuation detection method, computer device and storage medium
Yadav et al. Image Processing-Based Transliteration from Hindi to English
CN112966510A (en) Weapon equipment entity extraction method, system and storage medium based on ALBERT

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant