CN116579343A - A Named Entity Recognition Method for Chinese Culture and Tourism - Google Patents
A Named Entity Recognition Method for Chinese Culture and Tourism Download PDFInfo
- Publication number
- CN116579343A CN116579343A CN202310560194.XA CN202310560194A CN116579343A CN 116579343 A CN116579343 A CN 116579343A CN 202310560194 A CN202310560194 A CN 202310560194A CN 116579343 A CN116579343 A CN 116579343A
- Authority
- CN
- China
- Prior art keywords
- representation
- chinese
- tourism
- input
- named entity
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 30
- 230000004927 fusion Effects 0.000 claims abstract description 18
- 230000002457 bidirectional effect Effects 0.000 claims abstract description 7
- 230000006403 short-term memory Effects 0.000 claims abstract description 4
- 230000014509 gene expression Effects 0.000 claims description 18
- 230000006870 function Effects 0.000 claims description 15
- 230000015654 memory Effects 0.000 claims description 7
- 230000004913 activation Effects 0.000 claims description 6
- 238000011160 research Methods 0.000 abstract description 10
- 230000009286 beneficial effect Effects 0.000 abstract description 7
- 230000007787 long-term memory Effects 0.000 abstract 1
- 238000010586 diagram Methods 0.000 description 5
- 238000000605 extraction Methods 0.000 description 3
- 238000010276 construction Methods 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
- G06N3/0442—Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Character Discrimination (AREA)
Abstract
本发明公开了一种中文文旅类的命名实体识别方法,包括以下步骤:S1、获取中文文旅类文本数据,并将其输入至字符嵌入层,得到字符向量表示;S2、将字符向量表示输入至双向长短期记忆网络层,得到上下文表示;S3、将上下文表示输入至CNN层,得到多尺度的局部上下文特征融合表示;S4、将多尺度的局部上下文特征融合表示输入至CRF层,通过CRF层进行序列标注,完成中文文旅类的命名实体识别。本发明考虑到对中文旅游类的命名实体识别研究的关注度较少的问题,针对于中文的文旅类文本数据进行网络搭建,在CNN层利用第二CNN模块学习多尺度的局部上下文特征融合表示,加强语义之间的相关性,提高有利于中文识别的特征表示。
The invention discloses a named entity recognition method for Chinese culture and tourism, comprising the following steps: S1, acquiring text data of Chinese culture and tourism, and inputting it into a character embedding layer to obtain a character vector representation; S2, representing the character vector Input to the bidirectional long-term and short-term memory network layer to obtain the context representation; S3. Input the context representation to the CNN layer to obtain a multi-scale local context feature fusion representation; S4. Input the multi-scale local context feature fusion representation to the CRF layer, through The CRF layer performs sequence annotation and completes the named entity recognition of Chinese cultural tourism. The present invention takes into account the problem of less attention paid to the research on named entity recognition of Chinese tourism, and builds a network for Chinese cultural and tourism text data, and uses the second CNN module at the CNN layer to learn multi-scale local context feature fusion Representation, strengthen the correlation between semantics, and improve the feature representation that is beneficial to Chinese recognition.
Description
技术领域technical field
本发明属于信息提取技术领域,具体涉及一种中文文旅类的命名实体识别方法。The invention belongs to the technical field of information extraction, and in particular relates to a named entity recognition method for Chinese culture and tourism.
背景技术Background technique
命名实体识别(NER)是一项基本的信息提取任务,在自然语言处理(NLP)中能应用于许多下游任务,如信息抽取、社交媒体分析、搜素引擎、机器翻译、知识图谱。NER的目标是从句子中提取一些预定义的特定实体,并识别它们的正确类型,如人、地点、组织。早期的命名实体识别分为两类:基于规则的方法和基于统计的方法。随着深度学习的日渐强大,NER的研究取得了非常大的进步。涉及的领域多种多样:如医疗领域、金融领域、新闻领域等。但文旅类的命名实体识别的研究非常的稀缺,文旅类的命名实体识别的研究却没有受到关注。Named entity recognition (NER) is a basic information extraction task that can be applied to many downstream tasks in natural language processing (NLP), such as information extraction, social media analysis, search engine, machine translation, knowledge graph. The goal of NER is to extract some predefined specific entities from a sentence and recognize their correct type like person, place, organization. Early named entity recognition falls into two categories: rule-based methods and statistical-based methods. With the growing strength of deep learning, NER research has made great progress. The fields involved are various: such as the medical field, the financial field, the news field, etc. However, research on named entity recognition for cultural tourism is very scarce, and research on named entity recognition for cultural tourism has not received much attention.
根据语言之间的差异,关于特定语言的NER方法的研究也很多,如英语、阿拉伯语、印度语和其他语言,许多研究者主要集中于英文NER的研究。但中文身为一个重要的国际通用语言,在与英文相比,中文有它自己本身的特点,但对于中文NER的研究却相对英文NER来说却少很多,而且很多关于中文NER的研究都没有根据中文的特点做出针对性的研究。According to the differences between languages, there are also many studies on NER methods for specific languages, such as English, Arabic, Indian and other languages, and many researchers mainly focus on the research of English NER. However, Chinese is an important international language. Compared with English, Chinese has its own characteristics, but the research on Chinese NER is much less than that of English NER, and many studies on Chinese NER have no Make targeted research according to the characteristics of Chinese.
发明内容Contents of the invention
针对现有技术中的上述不足,本发明提供的一种中文文旅类的命名实体识别方法解决了目前的命名实体识别研究对中文文旅类的关注度较少的问题。In view of the above-mentioned deficiencies in the prior art, the invention provides a named entity recognition method for Chinese culture and tourism, which solves the problem that the current research on named entity recognition pays less attention to Chinese culture and tourism.
为了达到上述发明目的,本发明采用的技术方案为:一种中文文旅类的命名实体识别方法,包括以下步骤:In order to achieve the above-mentioned purpose of the invention, the technical solution adopted in the present invention is: a named entity recognition method for Chinese culture and tourism, comprising the following steps:
S1、获取中文文旅类文本数据,并将其输入至字符嵌入层,得到字符向量表示;S1. Obtain the text data of Chinese culture and tourism, and input it into the character embedding layer to obtain the character vector representation;
S2、将字符向量表示输入至双向长短期记忆网络层,得到上下文表示;S2. Input the character vector representation to the bidirectional long short-term memory network layer to obtain the context representation;
S3、将上下文表示输入至CNN层,得到多尺度的局部上下文特征融合表示;S3. Input the context representation to the CNN layer to obtain multi-scale local context feature fusion representation;
S4、将多尺度的局部上下文特征融合表示输入至CRF层,通过CRF层进行序列标注,完成中文文旅类的命名实体识别。S4. Input the multi-scale local contextual feature fusion representation to the CRF layer, and perform sequence annotation through the CRF layer to complete the named entity recognition of Chinese culture and tourism.
进一步地:所述S1中,字符嵌入层包括并行的ChineseBert模块和第一CNN模块;Further: in the S1, the character embedding layer includes a parallel ChineseBert module and the first CNN module;
所述S1包括以下分步骤:The S1 includes the following sub-steps:
S11、获取中文文旅类文本数据;S11. Acquiring Chinese cultural and tourism text data;
S12、将中文文旅类文本数据输入至ChineseBert模块,得到中文文旅类文本数据中每个字的字嵌入向量表示;S12. Input the Chinese cultural travel text data into the ChineseBert module, and obtain the word embedding vector representation of each word in the Chinese cultural travel text data;
S13、将中文文旅类文本数据输入至第一CNN模块,得到部首级嵌入表示;S13. Input the Chinese cultural tourism text data into the first CNN module to obtain a radical-level embedding representation;
S14、将字嵌入向量表示与部首级嵌入表示进行拼接,得到字符向量表示。S14. Concatenate the word embedding vector representation with the radical-level embedding representation to obtain a character vector representation.
进一步地:所述S12具体为:Further: said S12 is specifically:
将中文文旅类文本数据输入至ChineseBert模块,通过ChineseBert模块对输入的中文文旅类文本数据进行编码表示,得到特征向量,根据特征向量生成中文文旅类文本数据中每个字的字嵌入向量表示;Input the Chinese cultural tourism text data into the ChineseBert module, encode and express the input Chinese cultural tourism text data through the ChineseBert module, obtain the feature vector, and generate the word embedding vector of each word in the Chinese cultural tourism text data according to the feature vector express;
其中,所述特征向量包括标记嵌入、位置嵌入和分段嵌入。Wherein, the feature vector includes label embedding, position embedding and segment embedding.
进一步地:所述S13中,得到部首级嵌入表示M2的表达式具体为:Further: in said S13, the expression for obtaining the radical-level embedding representation M2 is specifically:
M2=A1(b1+C1(x))M 2 =A 1 (b 1 +C 1 (x))
式中,x为汉字部首级特征,C1(·)为第一CNN模块,A1为第一激活函数,b1为第一CNN模块的偏重。In the formula, x is the radical-level feature of Chinese characters, C 1 (·) is the first CNN module, A 1 is the first activation function, and b 1 is the weight of the first CNN module.
进一步地:所述S14中,得到字符向量表示Zconcat的表达式具体为:Further: in the said S14, the expression obtained character vector representation Z concat is specifically:
Zconcat=M1+M2 Z concat =M 1 +M 2
式中,M1为字嵌入向量表示。In the formula, M 1 is the word embedding vector representation.
上述进一步方案的有益效果为:经过字嵌入向量表示和部首级嵌入表示拼接得到的字符向量表示能够得到更多的语义特征,使得模型更好的识别文本中的中文含义。The beneficial effect of the above further solution is that the character vector representation obtained by concatenating the word embedding vector representation and the radical-level embedding representation can obtain more semantic features, so that the model can better recognize the Chinese meaning in the text.
进一步地:所述S2中,双向长短期记忆网络层包括第一~第十二LSTM单元,所述第一~第六LSTM单元正向处理输入的字符向量表示,所述第七~第十二LSTM单元反向处理输入的字符向量表示;Further: in said S2, the bidirectional long-short-term memory network layer includes the first to twelfth LSTM units, the first to sixth LSTM units are forward processing the input character vector representation, and the seventh to twelfth The LSTM unit inversely processes the character vector representation of the input;
得到上下文表示的方法具体为:The method of obtaining the context representation is as follows:
根据第一~第十二LSTM单元的输出结果进行拼接,得到上下文表示。进一步地:Splicing is performed according to the output results of the first to twelfth LSTM units to obtain the context representation. further:
进一步地:所述S2中,得到上下文表示H的表达式具体为:Further: in said S2, the expression for obtaining the context representation H is specifically:
H={h1,...,hti,...,hD}H={h 1 ,...,h ti ,...,h D }
式中,hti为第一~第十二LSTM单元的输出结果进行拼接,ti为拼接的序号,且ti=1,…,D,D为字符向量表示的维度;In the formula, h ti is the splicing of the output results of the first to twelfth LSTM units, ti is the sequence number of splicing, and ti=1,...,D, D is the dimension represented by the character vector;
所述第一~第十二LSTM单元均包括输入门it、输出门ot和遗忘门ft,其表达式具体为下式:The first to twelfth LSTM units all include an input gate it , an output gate o t and a forgetting gate f t , the expressions of which are specifically as follows:
it=σ(Wxixt+Whiht-1+Wcict-1+bi)i t = σ(W xi x t +W hi h t-1 +W ci c t-1 +b i )
ft=σ(Wxfxt+Whfht-1+Wcfct-1+bf)f t =σ(W xf x t +W hf h t-1 +W cf c t-1 +b f )
ct=ft⊙ct-1+it⊙tanh(Wxcxt+Whcht-1+bc)c t =f t ⊙c t-1 +i t ⊙tanh(W xc x t +W hc h t-1 +b c )
ot=σ(Wxoxt+Whoht-1+Wcoct+bo)o t =σ(W xo x t +W ho h t-1 +W co c t +b o )
ht=ot⊙tanh(ct)h t =o t ⊙tanh(c t )
式中,σ(·)为逐元的sigmoid函数,tanh(·)为双曲切线函数,⊙为逐元相乘函数,Wxi、Whi、Wci、Wxf、Whf、Wcf、Wxc、Whc、Wxo、Who和Wco均为权重参数,bi、bf、bc和bo均为偏重参数,ct为记忆细胞,ht为输出结果。In the formula, σ(·) is the per-element sigmoid function, tanh(·) is the hyperbolic tangent function, ⊙ is the per-element multiplication function, W xi , W hi , W ci , W xf , W hf , W cf , W xc , W hc , W xo , Who and W co are weight parameters, b i , b f , b c and b o are weight parameters, c t is memory cells, and h t is output results.
进一步地:所述S3中,CNN层设置有第二CNN模块,得到多尺度的局部上下文特征融合表示M3的表达式具体为:Further: in the S3, the CNN layer is provided with a second CNN module, and the multi-scale local context feature fusion representation M is obtained. The expression of M3 is specifically:
M3=A2(b2+C2(H))M 3 =A 2 (b 2 +C 2 (H))
式中,H为下文表示,C2(·)为第二CNN模块,A2为第二激活函数,b2为第二CNN模块的偏重。In the formula, H is the expression below, C 2 (·) is the second CNN module, A 2 is the second activation function, and b 2 is the weight of the second CNN module.
上述进一步方案的有益效果为:将上下文表示输入至第二CNN模块,能够加强语义之间的相关性,生成多尺度的局部上下文特征融合表示。The beneficial effect of the above further solution is: inputting the context representation to the second CNN module can strengthen the correlation between semantics and generate a multi-scale fusion representation of local context features.
本发明的有益效果为:本发明提供的一种中文文旅类的命名实体识别方法解决了对中文旅游类的命名实体识别研究的关注度较少的问题,针对于中文的文旅类文本数据进行网络搭建,在字符嵌入层利用第一CNN模块学习基于中文的部首级嵌入表示,得到有利于中文识别的字符向量表示,在CNN层利用第二CNN模块学习多尺度的局部上下文特征融合表示,加强语义之间的相关性,进一步提高有利于中文识别的特征表示。The beneficial effects of the present invention are as follows: a method for recognizing named entities of Chinese cultural and tourism categories provided by the present invention solves the problem that less attention is paid to the research on named entity recognition of Chinese tourism categories, and is aimed at Chinese cultural and tourism text data Carry out network construction, use the first CNN module at the character embedding layer to learn Chinese-based radical-level embedded representations, and obtain character vector representations that are conducive to Chinese recognition, and use the second CNN module at the CNN layer to learn multi-scale local context feature fusion representations , strengthen the correlation between semantics, and further improve the feature representation that is beneficial to Chinese recognition.
附图说明Description of drawings
图1为本发明的一种中文文旅类的命名实体识别方法流程图。FIG. 1 is a flow chart of a named entity recognition method for Chinese culture and tourism in the present invention.
图2为本发明整体的网络结构示意图。FIG. 2 is a schematic diagram of the overall network structure of the present invention.
图3为本发明的ChineseBert模块的结构示意图。Fig. 3 is a schematic structural diagram of the ChineseBert module of the present invention.
图4为本发明的第一CNN模块的结构示意图。FIG. 4 is a schematic structural diagram of the first CNN module of the present invention.
图5为本发明的第二CNN模块的结构示意图。FIG. 5 is a schematic structural diagram of the second CNN module of the present invention.
具体实施方式Detailed ways
下面对本发明的具体实施方式进行描述,以便于本技术领域的技术人员理解本发明,但应该清楚,本发明不限于具体实施方式的范围,对本技术领域的普通技术人员来讲,只要各种变化在所附的权利要求限定和确定的本发明的精神和范围内,这些变化是显而易见的,一切利用本发明构思的发明创造均在保护之列。The specific embodiments of the present invention are described below so that those skilled in the art can understand the present invention, but it should be clear that the present invention is not limited to the scope of the specific embodiments. For those of ordinary skill in the art, as long as various changes Within the spirit and scope of the present invention defined and determined by the appended claims, these changes are obvious, and all inventions and creations using the concept of the present invention are included in the protection list.
如图1所示,在本发明的一个实施例中,一种中文文旅类的命名实体识别方法,包括以下步骤:As shown in Figure 1, in one embodiment of the present invention, a named entity recognition method of a Chinese cultural tourism class, comprises the following steps:
S1、获取中文文旅类文本数据,并将其输入至字符嵌入层,得到字符向量表示;S1. Obtain the text data of Chinese culture and tourism, and input it into the character embedding layer to obtain the character vector representation;
S2、将字符向量表示输入至双向长短期记忆网络层,得到上下文表示;S2. Input the character vector representation to the bidirectional long short-term memory network layer to obtain the context representation;
S3、将上下文表示输入至CNN层,得到多尺度的局部上下文特征融合表示;S3. Input the context representation to the CNN layer to obtain multi-scale local context feature fusion representation;
S4、将多尺度的局部上下文特征融合表示输入至CRF层,通过CRF层进行序列标注,完成中文文旅类的命名实体识别。S4. Input the multi-scale local contextual feature fusion representation to the CRF layer, and perform sequence annotation through the CRF layer to complete the named entity recognition of Chinese culture and tourism.
在本实施例中,本发明提供一种针对中文汉字特点且适用领域为文旅类数据的基于部首级特征和多尺度的局部上下文特征融合表示的中文文旅的命名实体识别方法,网络的具体结构如图2所示。In this embodiment, the present invention provides a named entity recognition method for Chinese cultural tourism based on the fusion representation of radical-level features and multi-scale local context features for the characteristics of Chinese characters and the applicable field is cultural tourism data. The specific structure is shown in Figure 2.
所述S1中,字符嵌入层包括并行的ChineseBert模块和第一CNN模块;In said S1, the character embedding layer includes a parallel ChineseBert module and the first CNN module;
所述S1包括以下分步骤:The S1 includes the following sub-steps:
S11、获取中文文旅类文本数据;S11. Acquiring Chinese cultural and tourism text data;
S12、将中文文旅类文本数据输入至ChineseBert模块,得到中文文旅类文本数据中每个字的字嵌入向量表示;S12. Input the Chinese cultural travel text data into the ChineseBert module, and obtain the word embedding vector representation of each word in the Chinese cultural travel text data;
S13、将中文文旅类文本数据输入至第一CNN模块,得到部首级嵌入表示;S13. Input the Chinese cultural tourism text data into the first CNN module to obtain a radical-level embedding representation;
S14、将字嵌入向量表示与部首级嵌入表示进行拼接,得到字符向量表示。S14. Concatenate the word embedding vector representation with the radical-level embedding representation to obtain a character vector representation.
在本实施例中,ChineseBert模块的结构如图3所示,ChineseBert模块是通过中文语料预训练得到的预训练模型,专门针对于中文的文本数据进行处理。In this embodiment, the structure of the ChineseBert module is shown in FIG. 3 . The ChineseBert module is a pre-trained model obtained through pre-training Chinese corpus, and is specially processed for Chinese text data.
所述S12具体为:The S12 is specifically:
将中文文旅类文本数据输入至ChineseBert模块,通过ChineseBert模块对输入的中文文旅类文本数据进行编码表示,得到特征向量,根据特征向量生成中文文旅类文本数据中每个字的字嵌入向量表示;Input the Chinese cultural tourism text data into the ChineseBert module, encode and express the input Chinese cultural tourism text data through the ChineseBert module, obtain the feature vector, and generate the word embedding vector of each word in the Chinese cultural tourism text data according to the feature vector express;
其中,所述特征向量包括标记嵌入、位置嵌入和分段嵌入。Wherein, the feature vector includes label embedding, position embedding and segment embedding.
所述S13中,得到部首级嵌入表示M2的表达式具体为:In said S13, the expression for obtaining the radical-level embedding representation M2 is specifically:
M2=A1(b1+C1(x))M 2 =A 1 (b 1 +C 1 (x))
式中,x为汉字部首级特征,C1(·)为第一CNN模块,A1为第一激活函数,b1为第一CNN模块的偏重。In the formula, x is the radical-level feature of Chinese characters, C 1 (·) is the first CNN module, A 1 is the first activation function, and b 1 is the weight of the first CNN module.
在本实施例中,利用CNN对输入的中文文旅类文本数据进行部首级嵌入表示Radical-level Representation,得到部首级嵌入表示,其中第一CNN模块对输入数据进行Radical Representaion的结构示意图如图4所示。In this embodiment, CNN is used to perform Radical-level Representation on the input Chinese cultural and tourism text data, and the Radical-level Representation is obtained, where the first CNN module performs Radical Representaion on the input data. The structural diagram is as follows Figure 4 shows.
所述S14中,得到字符向量表示Zconcat的表达式具体为:In said S14, the expression obtained by character vector representing Z concat is specifically:
Zconcat=M1+M2 Z concat =M 1 +M 2
式中,M1为字嵌入向量表示。In the formula, M 1 is the word embedding vector representation.
经过字嵌入向量表示和部首级嵌入表示拼接得到的字符向量表示能够得到更多的语义特征,使得模型更好的识别文本中的中文含义。The character vector representation obtained by concatenating word embedding vector representation and radical-level embedding representation can obtain more semantic features, so that the model can better recognize the Chinese meaning in the text.
所述S2中,双向长短期记忆网络层包括第一~第十二LSTM单元,所述第一~第六LSTM单元正向处理输入的字符向量表示,所述第七~第十二LSTM单元反向处理输入的字符向量表示;In said S2, the bidirectional long-short-term memory network layer includes the first to twelfth LSTM units, the first to sixth LSTM units forwardly process the input character vector representation, and the seventh to twelfth LSTM units reverse A character vector representation of the input to the process;
得到上下文表示的方法具体为:The method of obtaining the context representation is as follows:
根据第一~第十二LSTM单元的输出结果进行拼接,得到上下文表示。Splicing is performed according to the output results of the first to twelfth LSTM units to obtain the context representation.
在本实施例中,双向长短期记忆网络层得到上下文表示块能够从正反两个方向提升语义表示,能够更好的识别段落中的语义。In this embodiment, the context representation blocks obtained by the bidirectional long-short-term memory network layer can improve the semantic representation from both positive and negative directions, and can better recognize the semantics in the paragraph.
所述S2中,得到上下文表示H的表达式具体为:In said S2, the expression for obtaining the context representation H is specifically:
H={h1,...,hti,...,hD}H={h 1 ,...,h ti ,...,h D }
式中,hti为第一~第十二LSTM单元的输出结果进行拼接,ti为拼接的序号,且ti=1,…,D,D为字符向量表示的维度;In the formula, h ti is the splicing of the output results of the first to twelfth LSTM units, ti is the sequence number of splicing, and ti=1,...,D, D is the dimension represented by the character vector;
所述第一~第十二LSTM单元均包括输入门it、输出门ot和遗忘门ft,其表达式具体为下式:The first to twelfth LSTM units all include an input gate it , an output gate o t and a forgetting gate f t , the expressions of which are specifically as follows:
it=σ(Wxixt+Whiht-1+Wcict-1+bi)i t = σ(W xi x t +W hi h t-1 +W ci c t-1 +b i )
ft=σ(Wxfxt+Whfht-1+Wcfct-1+bf)f t =σ(W xf x t +W hf h t-1 +W cf c t-1 +b f )
ct=ft⊙ct-1+it⊙tanh(Wxcxt+Whcht-1+bc)c t =f t ⊙c t-1 +i t ⊙tanh(W xc x t +W hc h t-1 +b c )
ot=σ(Wxoxt+Whoht-1+Wcoct+bo)o t =σ(W xo x t +W ho h t-1 +W co c t +b o )
ht=ot⊙tanh(ct)h t =o t ⊙tanh(c t )
式中,σ(·)为逐元的sigmoid函数,tanh(·)为双曲切线函数,⊙为逐元相乘函数,Wxi、Whi、Wci、Wxf、Whf、Wcf、Wxc、Whc、Wxo、Who和Wco均为权重参数,bi、bf、bc和bo均为偏重参数,ct为记忆细胞,ht为输出结果。In the formula, σ(·) is the per-element sigmoid function, tanh(·) is the hyperbolic tangent function, ⊙ is the per-element multiplication function, W xi , W hi , W ci , W xf , W hf , W cf , W xc , W hc , W xo , Who and W co are weight parameters, b i , b f , b c and b o are weight parameters, c t is memory cells, and h t is output results.
所述S3中,CNN层设置有第二CNN模块,得到多尺度的局部上下文特征融合表示M3的表达式具体为:In said S3, the CNN layer is provided with a second CNN module, and the multi-scale local context feature fusion representation M is obtained. The expression of M3 is specifically:
M3=A2(b2+C2(H))M 3 =A 2 (b 2 +C 2 (H))
式中,H为下文表示,C2(·)为第二CNN模块,A2为第二激活函数,b2为第二CNN模块的偏重。In the formula, H is the expression below, C 2 (·) is the second CNN module, A 2 is the second activation function, and b 2 is the weight of the second CNN module.
在本实施例中,第二CNN模块的结构如图5所示,将上下文表示输入至第二CNN模块,能够加强语义之间的相关性,生成多尺度的局部上下文特征融合表示。In this embodiment, the structure of the second CNN module is shown in FIG. 5 . Inputting the context representation to the second CNN module can strengthen the correlation between semantics and generate a multi-scale fusion representation of local context features.
将多尺度的局部上下文特征融合表示输入至CRF层,完成序列标注的任务进而完成中文文旅类的命名实体识别。The multi-scale local context feature fusion representation is input to the CRF layer to complete the task of sequence labeling and then complete the named entity recognition of Chinese cultural tourism.
本发明的有益效果为:本发明提供的一种中文文旅类的命名实体识别方法解决了对中文旅游类的命名实体识别研究的关注度较少的问题,针对于中文的文旅类文本数据进行网络搭建,在字符嵌入层利用第一CNN模块学习基于中文的部首级嵌入表示,得到有利于中文识别的字符向量表示,在CNN层利用第二CNN模块学习多尺度的局部上下文特征融合表示,加强语义之间的相关性,进一步提高有利于中文识别的特征表示。The beneficial effects of the present invention are as follows: a method for recognizing named entities of Chinese cultural and tourism categories provided by the present invention solves the problem that less attention is paid to the research on named entity recognition of Chinese tourism categories, and is aimed at Chinese cultural and tourism text data Carry out network construction, use the first CNN module at the character embedding layer to learn Chinese-based radical-level embedded representations, and obtain character vector representations that are conducive to Chinese recognition, and use the second CNN module at the CNN layer to learn multi-scale local context feature fusion representations , strengthen the correlation between semantics, and further improve the feature representation that is beneficial to Chinese recognition.
在本发明的描述中,需要理解的是,术语“中心”、“厚度”、“上”、“下”、“水平”、“顶”、“底”、“内”、“外”、“径向”等指示的方位或位置关系为基于附图所示的方位或位置关系,仅是为了便于描述本发明和简化描述,而不是指示或暗示所指的设备或元件必须具有特定的方位、以特定的方位构造和操作,因此不能理解为对本发明的限制。此外,术语“第一”、“第二”、“第三”仅用于描述目的,而不能理解为指示或暗示相对重要性或隐含指明的技术特征的数量。因此,限定由“第一”、“第二”、“第三”的特征可以明示或隐含地包括一个或者更多个该特征。In describing the present invention, it is to be understood that the terms "center", "thickness", "upper", "lower", "horizontal", "top", "bottom", "inner", "outer", " The orientation or positional relationship indicated by "radial" and so on is based on the orientation or positional relationship shown in the drawings, which is only for the convenience of describing the present invention and simplifying the description, rather than indicating or implying that the referred equipment or elements must have a specific orientation, Constructed and operative in a particular orientation and therefore are not to be construed as limitations of the invention. In addition, the terms "first", "second", and "third" are used for descriptive purposes only, and cannot be interpreted as indicating or implying relative importance or implicitly specifying the number of technical features. Therefore, a feature defined by "first", "second" and "third" may explicitly or implicitly include one or more of these features.
Claims (8)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310560194.XA CN116579343B (en) | 2023-05-17 | 2023-05-17 | A named entity recognition method for Chinese culture and tourism |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310560194.XA CN116579343B (en) | 2023-05-17 | 2023-05-17 | A named entity recognition method for Chinese culture and tourism |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116579343A true CN116579343A (en) | 2023-08-11 |
CN116579343B CN116579343B (en) | 2024-06-04 |
Family
ID=87543867
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310560194.XA Active CN116579343B (en) | 2023-05-17 | 2023-05-17 | A named entity recognition method for Chinese culture and tourism |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116579343B (en) |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2021114745A1 (en) * | 2019-12-13 | 2021-06-17 | 华南理工大学 | Named entity recognition method employing affix perception for use in social media |
US20210216862A1 (en) * | 2020-01-15 | 2021-07-15 | Beijing Jingdong Shangke Information Technology Co., Ltd. | System and method for semantic analysis of multimedia data using attention-based fusion network |
CN113408289A (en) * | 2021-06-29 | 2021-09-17 | 广东工业大学 | Multi-feature fusion supply chain management entity knowledge extraction method and system |
CN114118099A (en) * | 2021-11-10 | 2022-03-01 | 浙江工业大学 | A Chinese automatic question answering method based on radical features and multi-layer attention mechanism |
CN114781380A (en) * | 2022-03-21 | 2022-07-22 | 哈尔滨工程大学 | Chinese named entity recognition method, equipment and medium fusing multi-granularity information |
CN115455955A (en) * | 2022-10-18 | 2022-12-09 | 昆明理工大学 | Chinese named entity recognition method based on local and global character representation enhancement |
CN115600597A (en) * | 2022-10-18 | 2023-01-13 | 淮阴工学院(Cn) | Named entity identification method, device and system based on attention mechanism and intra-word semantic fusion and storage medium |
CN115688782A (en) * | 2022-10-26 | 2023-02-03 | 成都理工大学 | Named entity recognition method based on global pointer and countermeasure training |
-
2023
- 2023-05-17 CN CN202310560194.XA patent/CN116579343B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2021114745A1 (en) * | 2019-12-13 | 2021-06-17 | 华南理工大学 | Named entity recognition method employing affix perception for use in social media |
US20210216862A1 (en) * | 2020-01-15 | 2021-07-15 | Beijing Jingdong Shangke Information Technology Co., Ltd. | System and method for semantic analysis of multimedia data using attention-based fusion network |
CN113408289A (en) * | 2021-06-29 | 2021-09-17 | 广东工业大学 | Multi-feature fusion supply chain management entity knowledge extraction method and system |
CN114118099A (en) * | 2021-11-10 | 2022-03-01 | 浙江工业大学 | A Chinese automatic question answering method based on radical features and multi-layer attention mechanism |
CN114781380A (en) * | 2022-03-21 | 2022-07-22 | 哈尔滨工程大学 | Chinese named entity recognition method, equipment and medium fusing multi-granularity information |
CN115455955A (en) * | 2022-10-18 | 2022-12-09 | 昆明理工大学 | Chinese named entity recognition method based on local and global character representation enhancement |
CN115600597A (en) * | 2022-10-18 | 2023-01-13 | 淮阴工学院(Cn) | Named entity identification method, device and system based on attention mechanism and intra-word semantic fusion and storage medium |
CN115688782A (en) * | 2022-10-26 | 2023-02-03 | 成都理工大学 | Named entity recognition method based on global pointer and countermeasure training |
Non-Patent Citations (1)
Title |
---|
昌燕 等: "路径语义和特征提取相结合的负样本推荐方法", 小型微型计算机系统 * |
Also Published As
Publication number | Publication date |
---|---|
CN116579343B (en) | 2024-06-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110765775B (en) | A Domain Adaptation Method for Named Entity Recognition Fusing Semantics and Label Differences | |
Alwehaibi et al. | Comparison of pre-trained word vectors for arabic text classification using deep learning approach | |
Yu et al. | An attention mechanism and multi-granularity-based Bi-LSTM model for Chinese Q&A system | |
Xu et al. | An overview of deep generative models | |
Mehmood et al. | A precisely xtreme-multi channel hybrid approach for roman urdu sentiment analysis | |
CN110647612A (en) | Visual conversation generation method based on double-visual attention network | |
CN107423284A (en) | Merge the construction method and system of the sentence expression of Chinese language words internal structural information | |
Mahadevkar et al. | Exploring AI-driven approaches for unstructured document analysis and future horizons | |
Patel et al. | Deep learning for natural language processing | |
Alsaaran et al. | Arabic named entity recognition: A BERT-BGRU approach | |
CN112541356A (en) | Method and system for recognizing biomedical named entities | |
CN110427608A (en) | A kind of Chinese word vector table dendrography learning method introducing layering ideophone feature | |
CN114065761A (en) | Recognition method of Chinese nomenclature based on lexical enhancement | |
Salima et al. | An ontology-based approach to enhance explicit aspect extraction in standard Arabic reviews | |
Shu et al. | Investigating lstm with k-max pooling for text classification | |
CN112784602A (en) | News emotion entity extraction method based on remote supervision | |
CN109190112B (en) | Patent classification method, system and storage medium based on dual-channel feature fusion | |
Chandrasekaran et al. | Sarcasm Identification in text with deep learning models and Glove word embedding | |
CN116579343B (en) | A named entity recognition method for Chinese culture and tourism | |
CN117688152A (en) | A data legal question and answer system, electronic device and platform | |
CN114818711B (en) | Multi-information fusion named entity recognition method based on neural network | |
Netisopakul et al. | A survey of Thai knowledge extraction for the semantic web research and tools | |
CN116976355A (en) | Image-text mode-oriented self-adaptive Mongolian emotion analysis method | |
Tachicart et al. | Effective techniques in lexicon creation: Moroccan arabic focus | |
Sun et al. | Chinese microblog sentiment classification based on deep belief nets with extended multi-modality features |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |