CN113935324B - Cross-border national culture entity identification method and device based on word set feature weighting - Google Patents
Cross-border national culture entity identification method and device based on word set feature weighting Download PDFInfo
- Publication number
- CN113935324B CN113935324B CN202111068293.3A CN202111068293A CN113935324B CN 113935324 B CN113935324 B CN 113935324B CN 202111068293 A CN202111068293 A CN 202111068293A CN 113935324 B CN113935324 B CN 113935324B
- Authority
- CN
- China
- Prior art keywords
- cross
- word
- word set
- border
- character
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 53
- 238000007781 pre-processing Methods 0.000 claims abstract description 15
- 238000002372 labelling Methods 0.000 claims abstract description 5
- 239000013598 vector Substances 0.000 claims description 157
- 238000012549 training Methods 0.000 claims description 36
- 235000005911 diet Nutrition 0.000 claims description 18
- 230000037213 diet Effects 0.000 claims description 15
- 230000002457 bidirectional effect Effects 0.000 claims description 14
- 238000010606 normalization Methods 0.000 claims description 14
- 230000006870 function Effects 0.000 claims description 13
- 238000004364 calculation method Methods 0.000 claims description 10
- 238000001914 filtration Methods 0.000 claims description 10
- 238000012545 processing Methods 0.000 claims description 9
- 239000011159 matrix material Substances 0.000 claims description 8
- 230000004913 activation Effects 0.000 claims description 6
- 238000013528 artificial neural network Methods 0.000 claims description 6
- 230000007774 longterm Effects 0.000 claims description 3
- 230000001419 dependent effect Effects 0.000 claims 2
- 238000003058 natural language processing Methods 0.000 abstract description 2
- 241000209094 Oryza Species 0.000 description 14
- 235000007164 Oryza sativa Nutrition 0.000 description 14
- 235000009566 rice Nutrition 0.000 description 14
- 230000000694 effects Effects 0.000 description 7
- 235000015278 beef Nutrition 0.000 description 6
- 230000007246 mechanism Effects 0.000 description 6
- 238000000605 extraction Methods 0.000 description 5
- 230000008569 process Effects 0.000 description 5
- 239000002689 soil Substances 0.000 description 4
- 238000004458 analytical method Methods 0.000 description 3
- 230000000052 comparative effect Effects 0.000 description 3
- 238000013527 convolutional neural network Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 230000000378 dietary effect Effects 0.000 description 3
- 238000002474 experimental method Methods 0.000 description 3
- 241000251468 Actinopterygii Species 0.000 description 2
- 244000099147 Ananas comosus Species 0.000 description 2
- 235000007119 Ananas comosus Nutrition 0.000 description 2
- 235000017166 Bambusa arundinacea Nutrition 0.000 description 2
- 235000017491 Bambusa tulda Nutrition 0.000 description 2
- 241000283690 Bos taurus Species 0.000 description 2
- 235000004936 Bromus mango Nutrition 0.000 description 2
- 244000012254 Canarium album Species 0.000 description 2
- 235000009103 Canarium album Nutrition 0.000 description 2
- 244000025254 Cannabis sativa Species 0.000 description 2
- 244000166675 Cymbopogon nardus Species 0.000 description 2
- 235000018791 Cymbopogon nardus Nutrition 0.000 description 2
- 241000692870 Inachis io Species 0.000 description 2
- 240000007228 Mangifera indica Species 0.000 description 2
- 235000014826 Mangifera indica Nutrition 0.000 description 2
- 241000287127 Passeridae Species 0.000 description 2
- 244000082204 Phyllostachys viridis Species 0.000 description 2
- 235000015334 Phyllostachys viridis Nutrition 0.000 description 2
- 235000009184 Spondias indica Nutrition 0.000 description 2
- 238000013459 approach Methods 0.000 description 2
- 239000011425 bamboo Substances 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 235000021403 cultural food Nutrition 0.000 description 2
- 235000021438 curry Nutrition 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 235000013305 food Nutrition 0.000 description 2
- 230000010354 integration Effects 0.000 description 2
- 230000007704 transition Effects 0.000 description 2
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 2
- 206010027514 Metrorrhagia Diseases 0.000 description 1
- 241001585714 Nola Species 0.000 description 1
- 238000003287 bathing Methods 0.000 description 1
- 230000008034 disappearance Effects 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/242—Dictionaries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Molecular Biology (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computing Systems (AREA)
- Data Mining & Analysis (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Biomedical Technology (AREA)
- Probability & Statistics with Applications (AREA)
- Machine Translation (AREA)
Abstract
The invention relates to a cross-border ethnic culture entity recognition method and device based on word set characteristic weighting, and belongs to the technical field of natural language processing. The invention provides a cross-border national culture entity recognition method based on word set characteristic weighting, which aims at the characteristics of cross-border national culture entities and comprises four parts, namely cross-border national culture entity data labeling and data preprocessing, cross-border national culture text characteristic representation of word set characteristic information, cross-border national culture entity recognition models based on word set characteristic weighting and cross-border national culture entity recognition. The cross-border national culture entity recognition device based on word set feature weighting is manufactured according to the four parts of functional modules, and entity recognition is carried out on input sentences.
Description
Technical Field
The invention relates to a cross-border ethnic culture entity recognition method and device based on word set characteristic weighting, and belongs to the technical field of natural language processing.
Background
The information extraction comprises entity identification, relation extraction and event extraction, wherein the entity identification is a basic task in the information extraction, the entity identification needs to determine entity boundaries and classify the entity boundaries into predefined entity types, and the domain knowledge graph can be favorably expanded and information retrieval can be supported by mining cross-border ethnic culture entities. Entities related to cross-border national culture are automatically marked from the Internet by utilizing an entity identification technology, so that the time for researchers to manually extract and process information is shortened. The method integrates the lexical characteristics into the entity recognition model to solve the problem of entity boundary ambiguity in cross-border national culture. The method has the advantages that the word set characteristics are integrated into the entity recognition model, so that a good effect can be achieved, the problem of fuzzy domain word boundaries is relieved, and the representation of text semantic information is enhanced. The cross-border national culture entity is usually formed by combining field vocabularies describing national culture characteristics, a large number of field words exist in cross-border national culture data, for example, the word "Sang Kan is a nickname of" splash water festival "rather than" Mayi ", the field words belong to festival type entities, and the entity recognition method fused with word set information can obtain a good effect.
Disclosure of Invention
The invention provides a cross-border national culture entity recognition method and device based on word set characteristic weighting, which aims to improve entity recognition of fuzzy boundary of cross-border national culture entities and enhance cross-border national culture text representation of integrated word set information.
The technical scheme of the invention is as follows: in a first aspect, a method for recognizing a cross-border national culture entity based on word set feature weighting comprises the following specific steps:
step1, marking and preprocessing the cross-border national culture entity data: performing character filtering on an input cross-border national culture sentence, segmenting the sentence into characters and performing character vector representation;
due to the lack of entity data sets in the cross-border national culture field, six types of entity types including diet, festivals, custom and the like are defined by combining the characteristics of a large number of field entities in cross-border national culture data, 15717 cross-border national culture data sets with entity labels are marked in an artificial mode, and the data sets play a good supporting role in entity recognition model training.
Step2, cross-border national culture text feature representation of the feature information of the merged word set: acquiring a word set through cross-border national culture field dictionary matching, and providing a word set characteristic weighting method and position information codes for acquiring word set information and integrating the word set information into character vector representation;
the cross-border national culture entity is usually formed by combining field vocabularies describing national culture characteristics, such as a 'Meng Yong soil pan' in dietary culture, and because word sets comprise word boundaries and word meaning information, corresponding rules are formulated to match with a cross-border national culture field dictionary to obtain four word sets, a word set characteristic weighting method and position information codes are provided for obtaining word set information, and cross-border national culture characteristic semantic information is enhanced.
Step3, training a cross-border national culture entity recognition model based on word set characteristic weighting; extracting the characteristics of the context of the cross-border national sentences by using the thought of a bidirectional gating circulation unit, and performing entity recognition model training based on word set characteristic weighting by adopting optimal entity label probability calculation;
in order to enable the model to obtain the semantic information of the context of the cross-border national culture text, for example, the vector representation of the sentence Dai nationality grass grilled fish as a special food needs to be associated with the context 'citronella grass', aiming at the problem of word dependence of combination characteristics, the idea of a bidirectional gating circulation unit is proposed to be integrated into the invention to extract the characteristics of the cross-border national sentence context, and the entity recognition model based on word set characteristic weighting is trained by adopting optimal entity label probability calculation.
Step4, cross-border ethnic culture entity recognition: by using the trained cross-border national culture entity recognition model, the cross-border national culture entity recognition is carried out after data preprocessing is carried out on the input text.
As a preferable scheme of the invention, the Step1 comprises the following specific steps:
step1.1, obtain cross-border national culture data through cross-border national culture website, the data carries out preprocessing such as deduplication, filter special character, and every cross-border national culture sentence has all marked corresponding entity label, for example the sentence "Dai nationality has the glutinous rice goods of many unique characteristics: such as fragrant bamboo rice, glutinous rice, rice cake, etc. "the entity of the sentence is marked by manpower as" bamboo rice-diet culture, glutinous rice-diet culture, metrorrhagia-diet culture, thousand-layer rice cake-diet culture ". By utilizing the method, 15717 cross-border national culture sentences with entity labels are artificially labeled, the entity types in the fields comprise positions, festivals, diet, custom, literature and buildings, and the analysis of the entity types is shown in the table 1:
TABLE 1 Cross-border ethnic culture entity type analysis
The character entity label for segmenting the cross-border national culture sentence has a plurality of specifications, such as labels of 'BIO', 'BMESO' and the like, because most of entities in the cross-border national culture field are composed of combined features, the cross-border national culture sentence is separated from the label through character segmentation, each character is labeled by using a 'BMESO' labeling method, wherein B represents an entity starting position, M represents an entity internal position, E represents an entity ending position, S represents a single entity, and O represents a non-entity. For example, the sentence \36181, the label corresponding to the separated folk of the Dai nationality is 'B-XS E-XS O O O O O O O O O', the B-XS represents the beginning label of the folk of the entity type, the E-XS represents the ending label of the folk of the entity type, and the O represents the non-entity label. The defined cross-border ethnic culture entity tag format is shown in table 2:
TABLE 2 Cross-border ethnic culture entity label format
Entity name | Entity type | Entity tag |
Ruili (a Chinese character of' Ruili | Position of | B-WZ/E-WZ |
Water-splashing water-saving device | Festival culture | B-JR/M-JR/E-JR |
Canarium album | Food culture | B-YS/M-YS/E-YS |
\36181 ` Buddha | Custom culture | B-XS/E-XS |
Hand dancing | Culture of literature and art | B-WY/M-WY/E-WY |
Soil chamber | Building culture | B-JZ/M-JZ/E-JZ |
Step1.2, processing cross-border national culture data such as de-weighting, special character filtering and the like to construct a cross-border national culture field dictionary so as to obtain word set information later, enhancing sentence semantic information by constructing the cross-border national culture field dictionary, training and constructing the field dictionary by combining the cross-border national culture data acquired on the network by field words, wherein the field dictionary contains words related to festivals, buildings, custom, diet, positions and literature in cross-border national culture, such as cross-border national culture words like' Nolat (literature), bei river county (position), south-oriented palace (building), curry crab mango scented rice (diet), bath Buddha ceremony (custom), and Ji Xia Jie (festival) ".
Step1.3, performing character vector representation on the cross-border national culture text by adopting a pre-training language model, performing special treatment on the characters, and then inputting the characters into a Transformer Encoder layer to obtain the vector representation of each character of the input text. For example, the text "Dai peacock dancing" is represented as E = { c after the bitwise addition of three Embedding elements [CLS] ,c Dai nationality's nationality ,c Family of people ,c Hole(s) ,c Sparrow ,c Dancing ,c [SEP] In which c is [CLS] And c [SEP] A special token vector for text. The final output of the Transformer Encoder can be obtained through a series of normalization and linear processing. The cross-border national culture sentence is regarded as a character sequence S = { c 1 ,c 2 ,…,c n }∈V c In which V is c Is a vocabulary of character level, c i Representing the ith character in a sentence S of length n, the idea of the pre-trained language model for each character c of the cross-border cultural ethnic entity i Performing word vector representation:
Q=c i ×W Q ,K=c i ×W K ,V=c i ×W V ,
g i =Attention(Q,K,V).
wherein, W Q ,W K ,W V Representing a weight parameter, d k Softmax is the normalization operation for the dimension of the input feature vector. The final output of the Transformer Encoder can be obtained through a series of normalization and linear processing, and the process is implemented by continuously carrying out the process on each character in the textThe dynamic generation of character vectors in cross-border national culture text is now performed.
As a preferable scheme of the invention, the Step2 comprises the following specific steps:
step2.1, cross-border national culture field word set matching method: the word set is obtained by cross-border ethnic culture characters from a dictionary and forms four word sets according to the character positions. The domain dictionary contains word boundary information and cross-border national culture text semantic information, and the boundary information and the semantic information in the matched words can be reserved through character matching. Character c i Matching with a domain dictionary to obtain different words, and dividing the words into four word set types according to different positions of the characters in the matched words: the character is located at the head (B) of the word, the character is located at the inner part (M) of the word, the character is located at the tail (E) of the word, and a single character (S) is marked by four labels. For example, the entity "crisp beef jerky" in the dietary culture, the word set matched by the following formula for the character "cattle" is B = { beef, beef jerky }, M = { crisp beef jerky, crisp beef }, E = { crisp beef }, S = { cattle }. For example, the entity "pineapple purple rice" in the diet culture, the word set matched by the following formula for the character "rice" is B = { rice }, M = { pineapple purple rice, purple rice }, E = { purple rice }, and S = { rice }.
Cross-border national culture sentence S = { c 1 ,c 2 ,…,c n }∈V c Character c in i Four position type word set matching modes of the matched words are as follows:
wherein, V w Representing a pre-constructed domain dictionary, w representing words existing in the domain dictionary, i representing positions of characters, j, k representing positions on both sides of the characters, and n representing the number of characters in a sentence.
Step2.2, acquiring a word set vector: by counting the word frequency of each word in the data set, because the word frequency can represent the importance degree of the word, the four types of word vectors are endowed with corresponding word frequencies by using a weighting method. The word frequency of words matched with characters in the cross-border national culture text is fused into word vectors, and the word vectors in each type are spliced to obtain word set vector representation of each type:
wherein, z (w) i ) Is the word w i Word frequency, e (w), counted in the data set i ) Is the word w i Corresponding dimension is d w Word vector representation of = 50. L represents one of four types of { B, M, E, S }, v i (L) is a word set vector with 1 × d dimensions w 。
Step2.3, weighting the feature of the word set to obtain the importance degree among the word set vectors: word set vector v i (L)={v i (B),v i (M),v i (E),v i (S) is obtained by word vector concatenation in each type, only word vectors with different weights in each type are calculated. In order to fully consider the importance degree among the four types of word set vectors, the importance degree among the word set vectors is obtained by using a word set characteristic weighting method, so that more weights can be obtained by the important word set vectors. Using the word set vector v obtained by Step2.2 i (L)={v i (B),v i (M),v i (E),v i (S) } obtaining a weight matrix W through neural network training v The final weight vector is then output by the Softmax function:
V i =W v [v i (B);v i (M);v i (E);v i (S)]+b v
α i =Softmax(V i )
wherein, W v Dimension of 1 × d w Training parameters of d w =50,b v Bias of dimension 1 × 4The shift amount, softmax function, is a normalization operation. Finally, a weight vector alpha with the dimension of 1 multiplied by 4 and the value range of (0,1) is obtained i 。
Step2.4, position coding enhanced position information: the character positions in the cross-border national culture text contain word boundary information, words matched according to the positions of the characters are different, so that position codes are added into word set vectors, four types of word set vectors are distinguished according to the positions of the characters, the four types of positions are vectorized and expressed by adopting the vectors, and the word set vectors fused with the position codes are expressed as follows:
v i (B)=p i (B)W L +v i (B)
v i (M)=p i (M)W L +v i (M)
v i (E)=p i (E)W L +v i (E)
v i (S)=p i (S)W L +v i (S)
wherein p is i (B)=[1,0,0,0],p i (M)=[0,1,0,0],p i (E)=[0,0,1,0],p i (S)=[0,0,0,1],W L Is a 4 xd w Training parameters of d w =50。
Step2.5, word set information is merged into the character vector representation: in order to keep as much domain dictionary information as possible, each character vector and the four types of word set vectors corresponding to the character are combined into a feature vector, and the feature vector jointly form the final representation of the character:
e i (B,M,E,S)=[α i1 v i (B);α i2 v i (M);α i3 v i (E);α i4 v i (S)],
x i =[g i ;e i (B,M,E,S)].
wherein, [ alpha ] is i1 ,α i2 ,α i3 ,α i4 ]=α i As a weight vector, e i (B, M, W, S) represents the feature vector for four types of stitching, x i Feature vectors, g, representing information of the merged set of words i Is StThe character vector in ep 1.3.
As a preferable scheme of the invention, the Step3 comprises the following specific steps:
step3.1, aiming at the problem of dependence of combined characteristic words in the cross-border national culture text, the characteristic vector x of the word set information fused into Step2.5 i The information is input into a reset gate and an update gate in a bidirectional gated loop unit (GRU), respectively, and the control information of the reset gate is lost, so that the information content in the past is determined to be forgotten, and the information content is retained and combined with the input of the current time step. When r approaches 0, the previous hidden state is ignored and only the current input is used for reset. The update gate decides how much information to pass to the next state, which allows the model to copy all the information from the previous state to reduce the risk of gradient disappearance. The information of the reset gate and the refresh gate is represented as follows
r i =σ(W r ·[x i ,h i-1 ])
u i =σ(W u ·[x i ,h i-1 ])
Where σ is the sigmoid activation function, x i To incorporate the token vector of the word information, h i-1 Is the hidden state at the previous moment, r i Is a reset gate u i Is to update the door, W r ,W u Are training parameters.
New hidden state h i From the previous hidden state h i-1 And the current input x i And calculating to obtain the target product.
Wherein,is a training parameter, tanh (-) is an activation functionAnd (4) counting. Feature vector h obtained based on bidirectional GRU coding layer i The long-term dependency relationship between the contextual information in the cross-border ethnic culture text is obtained.
Step3.2, considering the dependency relationship among the cross-border national culture entity labels, avoiding the error condition existing in the cross-border national culture entity identification, for example, the unreasonable condition that the entity label of 'splash water festival' is 'B-JR M-JR E-JR', and the entity label 'M-YS' appears behind the 'B-JR' in the training process and diet is taken as an internal label, carrying out optimal label probability calculation on the feature vector, and predicting the entity label through a cross-border national culture entity identification model.
P i =W p h i +b p ,
Wherein W p ,b p Is the parameter of the calculated score matrix P, T is a transition matrix, h i The vector is output for Step3.1.
The degree of importance of the self-attention mechanism to the extraction of neighboring feature vectors is exploited to enhance useful features and reduce less useful features. Feature vector h after bidirectional GRU coding i And calculating the corresponding weight of the feature vector by using a self-attention mechanism.
Q=h i ×W Q ,K=h i ×W K ,V=h i ×W V ,
head i =Attention(Q,K,V).
Wherein, W Q ,W K ,W V Representing a weight parameter, d k Softmax is the normalization operation for the dimension of the input feature vector.
The relevance and the importance degree of the feature vectors are reflected through a self-attention mechanism, and the cross-border ethnic culture entity recognition is completedBecause the characteristic vectors have different influences on entity identification, corresponding weights are given to all the characteristic vectors according to the influence, and then the final output vector head is obtained i . The self-attention mechanism can further improve the degree of distinction of importance among the components of the feature vector, thereby being beneficial to the identification of the cross-border national culture entity.
In a second aspect, an embodiment of the present invention further provides a cross-border ethnic cultural entity recognition apparatus weighted based on word set features, which includes modules for performing the method of the first aspect.
The beneficial effects of the invention are:
1. the invention integrates the word set information into the entity recognition model, the word set obtained by the dictionary of the character matching field contains entity boundary information, and the word set is utilized to realize the enhancement of the cross-border national culture text semantic information, so that the model can achieve better effect on the cross-border national culture entity recognition.
2. The invention obtains the importance degree between the word set vectors based on the feature weighting of the word set, and enhances the position information of the word set matched with the characters by utilizing the position coding, so that the features of the word set and the vectors are richer. The word set features are merged into the character representation, so that the problem that entity boundary ambiguity exists in entity recognition based on character representation, and entity recognition errors are caused.
Drawings
FIG. 1 is a word set information diagram based on word set feature weighting in the present invention;
FIG. 2 is a diagram illustrating exemplary word frequency statistics in the present invention;
FIG. 3 is a cross-border national culture entity recognition frame diagram based on word set feature weighting in the present invention;
fig. 4 is an overall flow chart of the present invention.
Detailed Description
Example 1: as shown in fig. 1 to 4, in a first aspect, a method for recognizing a cross-border national culture entity based on word set feature weighting specifically includes the following steps:
step1, marking and preprocessing the cross-border national culture entity data: performing character filtering on an input cross-border national culture sentence, segmenting the sentence into characters and performing character vector representation;
due to the lack of entity data sets in the cross-border national culture field, six types of entity types including diet, festivals, custom and the like are defined by combining the characteristics of a large number of field entities in cross-border national culture data, 15717 cross-border national culture data sets with entity labels are marked in an artificial mode, and the data sets play a good supporting role in entity recognition model training.
Step2, cross-border national culture text characteristic representation of the characteristic information of the merged word set: obtaining a word set through cross-border national culture field dictionary matching, and providing a word set characteristic weighting method and position information codes for obtaining word set information and integrating the word set information into character vector representation;
the cross-border ethnic culture entity is usually formed by combining the words in the field for describing the ethnic culture characteristics, such as 'Meng Yong soil pot' in the dietary culture, and because the word sets contain word boundaries and word meaning information, the invention sets corresponding rules to be matched with a cross-border ethnic culture field dictionary to obtain four word sets, provides a word set characteristic weighting method and position information coding for obtaining word set information, and enhances the cross-border ethnic culture characteristic semantic information.
Step3, training a cross-border national culture entity recognition model based on word set characteristic weighting; extracting the characteristics of the context of the cross-border national sentences by using the thought of a bidirectional gating circulation unit, and performing entity recognition model training based on word set characteristic weighting by adopting optimal entity label probability calculation;
in order to enable the model to obtain the semantic information of the context of the cross-border national culture text, for example, the vector representation of the sentence Dai nationality grass grilled fish as a special food needs to be associated with the context 'citronella grass', aiming at the problem of word dependence of combination characteristics, the idea of a bidirectional gating circulation unit is proposed to be integrated into the invention to extract the characteristics of the cross-border national sentence context, and the entity recognition model based on word set characteristic weighting is trained by adopting optimal entity label probability calculation.
Step4, recognizing cross-border ethnic culture entities: by using the trained cross-border national culture entity recognition model, the cross-border national culture entity recognition is carried out after data preprocessing is carried out on the input text.
As a preferable scheme of the invention, the Step1 comprises the following specific steps:
step1.1, acquiring related cross-border national culture data on a cross-border national culture website, and manually marking 15717 cross-border national culture sentences with entity labels, wherein entity types are defined as 6 types: location, holiday culture, diet culture, custom culture, literary and artistic culture, and architectural culture; segmenting characters in the cross-border national culture sentence and corresponding labels, so that each character corresponds to one label, and the format of the corresponding entity label is shown in table 3:
TABLE 3 Cross-border ethnic culture entity label format
Entity name | Entity type | Entity label |
Ruili (a Chinese character of' Ruili | Position of | B-WZ/E-WZ |
Water-splashing water-saving device | Festival culture | B-JR/M-JR/E-JR |
Canarium album | Food culture | B-YS/M-YS/E-YS |
Wing 36181 | Custom culture | B-XS/E-XS |
Hand dancing | Culture of literature and art | B-WY/M-WY/E-WY |
Soil palm room | Building culture | B-JZ/M-JZ/E-JZ |
Step1.2, sentence semantic information is enhanced by constructing a cross-border national culture field dictionary, the field dictionary is trained and constructed through cross-border national culture data acquired on the network in combination with field words, and the field dictionary contains words related to festival, building, custom, diet, position and literature in cross-border national culture, such as cross-border culture words of 'Nola dance (literature), north river county (position), southeast Dagong (building), curry crab mango rice (diet), bathing Buddha ceremony (custom), and Xia Jie (festival)'.
Step1.3, performing character vector representation on the cross-border national culture text by adopting a pre-training language model, performing special treatment on the characters, and then inputting the characters into a Transformer Encoder layer to obtain the vector representation of each character of the input text. For example, the text "Dai peacock dancing" is represented as E = { c after the bitwise addition of three Embedding elements [CLS] ,c Dai nationality's nationality ,c Family of people ,c Hole(s) ,c Sparrow ,c Dancing ,c [SEP] In which c is [CLS] And c [SEP] A special token vector for text. And obtaining the final output of the Transformer Encoder through a series of normalization and linear processing. Cross-border ethnic groupThe cultural sentence is regarded as a character sequence S = { c 1 ,c 2 ,…,c n }∈V c In which V is c Is a vocabulary of character level, c i Representing the ith character in a sentence S of length n, the idea of the pre-trained language model for each character c of the cross-border cultural ethnic entity i Performing word vector representation:
Q=c i ×W Q ,K=c i ×W K ,V=c i ×W V ,
g i =Attention(Q,K,V).
wherein, W Q ,W K ,W V Representing a weight parameter, d k Softmax is the normalization operation for the dimension of the input feature vector. The final output of the Transformer Encoder can be obtained through a series of normalization and linear processing, and the dynamic generation of the character vectors in the cross-border national culture text is realized by continuously carrying out the processes on each character in the text.
As a preferable scheme of the invention, the Step2 comprises the following specific steps:
step2.1, matching word sets in the cross-border national culture field: the word set is obtained by cross-border ethnic culture characters from a dictionary and forms four word sets according to the character positions. The domain dictionary contains word boundary information and cross-border national culture text semantic information, and the boundary information and the semantic information in the matched words can be reserved through character matching. Character c i Matching with a domain dictionary to obtain different words, and dividing the words into four word set types according to different positions of the characters in the matched words: the character is located at the head (B) of the word, the character is located at the inner part (M) of the word, the character is located at the tail (E) of the word, and a single character (S) is marked by four labels.
Cross-border national culture sentence S = { c 1 ,c 2 ,…,c n }∈V c Character c in i Four location type word set matching of matched wordsThe preparation method comprises the following steps:
wherein V w Representing a pre-constructed domain dictionary, w representing words existing in the domain dictionary, i representing positions of characters, j, k representing positions on both sides of the characters, and n representing the number of characters in a sentence.
Step2.2, acquiring a word set vector: as shown in fig. 2, the word frequency of the matching word is counted, and the word frequency of the matching word is merged into the word vector because the word frequency can represent the importance degree of the word, and the four types of word vectors are assigned with corresponding word frequencies by using a weighting method. The word frequency of words matched with characters in the cross-border national culture text is fused into word vectors, and the word vectors in each type are spliced to obtain word set vector representation of each type:
wherein, z (w) i ) Is the word w i Word frequency, e (w), counted in the data set i ) Is the word w i Corresponding dimension is d w Word vector representation of = 50. L represents one of four types of { B, M, E, S }, v i (L) is a word set vector with 1 × d dimensions w 。
Step2.3, weighting the feature of the word set to obtain the importance degree among the word set vectors: word set vector v i And (L) is obtained by splicing word vectors in each type, and only word vectors with different weights in each type are calculated. In order to fully consider the importance degree among the four types of word set vectors, the importance degree among the word set vectors is obtained by using a word set characteristic weighting method, and a weight matrix W is obtained through neural network training v Then passes SoftmaxThe function outputs the final weight vector:
V i =W v [v i (B);v i (M);v i (E);v i (S)]+b v
α i =Softmax(V i )
wherein, W v Dimension of 1 × d w Training parameters of d w =50,b v The dimension is the offset of 1 × 4, and the Softmax function is the normalization operation. Finally, a weight vector alpha with the dimension of 1 multiplied by 4 and the value range of (0,1) is obtained i 。
Step2.4, position coding enhanced position information: the character positions in the cross-border national culture text contain word boundary information, words matched according to the positions of the characters are different, so that position codes are added into word set vectors, four types of word set vectors are distinguished according to the positions of the characters, the four types of positions are vectorized and expressed by adopting the vectors, and the word set vectors fused with the position codes are expressed as follows:
v i (B)=p i (B)W L +v i (B)
v i (M)=p i (M)W L +v i (M)
v i (E)=p i (E)W L +v i (E)
v i (S)=p i (S)W L +v i (S)
wherein p is i (B)=[1,0,0,0],p i (M)=[0,1,0,0],p i (E)=[0,0,1,0],p i (S)=[0,0,0,1],W L Is a 4 xd w Training parameters of d w =50。
Step2.5, word set information is merged into the character vector representation: in order to retain as much domain dictionary information as possible, each character vector and the four types of word set vectors corresponding to the character are combined into a feature vector, and the feature vector together form the final representation of the character:
e i (B,M,E,S)=[α i1 v i (B);α i2 v i (M);α i3 v i (E);α i4 v i (S)],
x i =[g i ;e i (B,M,E,S)].
wherein, [ alpha ] is i1 ,α i2 ,α i3 ,α i4 ]=α i As a weight vector, e i (B, M, W, S) represents the feature vector for four types of stitching, x i Feature vectors, g, representing information of the merged set of words i Is a character vector.
As a preferable scheme of the invention, the Step3 comprises the following specific steps:
step3.1, utilizing bidirectional GRU to extract the characteristics of the vector representation of the cross-border national culture integrated word set information, and integrating the characteristic vector x of the word set information i The information of the reset gate is lost, which determines how much information content needs to be forgotten, and how much information content is reserved to be combined with the input of the current time step. When r approaches 0, the previous hidden state is ignored and only the current input is used for reset. The update gate decides how much information to pass to the next state, which allows the model to copy all the information from the previous state to reduce the risk of gradient vanishing. The information of the reset gate and the refresh gate is represented as follows
r i =σ(W r ·[x i ,h i-1 ])
u i =σ(W u ·[x i ,h i-1 ])
Where σ is sigmoid activation function, x i To incorporate the token vector of the word information, h i-1 Is the hidden state at the last moment, r i Is a reset gate u i Is to update the door, W r ,W u Are training parameters.
In a bidirectional GRU, a new hidden state h i From the previous hidden state h i-1 And the current input x i And calculating to obtain the target product.
Wherein,is a training parameter, and tanh (-) is an activation function. Feature vector h obtained based on bidirectional GRU coding layer i The long-term dependency relationship between the contextual information in the cross-border ethnic culture text is obtained.
Step3.2, the degree of importance of the self-attention mechanism for extracting adjacent feature vectors, enhances useful features and reduces less useful features. Feature vector h after bidirectional GRU coding i The feature vector weights are calculated using the self-attention mechanism:
Q=h i ×W Q ,K=h i ×W K ,V=h i ×W V ,
head i =Attention(Q,K,V).
wherein, W Q ,W K ,W V Representing a weight parameter, d k =50 dimensions of input feature vector, softmax normalization operation.
As a preferable scheme of the invention, the Step4 comprises the following specific steps:
step4.1, through the idea of global optimization, a global optimal tag sequence is obtained by considering the dependency relationship among tags, so that some error conditions are prevented, for example, unreasonable conditions such as access of a 'diet' after a 'holiday' of a tag occur.
By a character s = { c in cross-border ethnic culture text 1 ,c 2 ,…,c n }∈V c Corresponding predicted tag sequence y = { y 1 ,y 2 ,…,y n Calculating the probability:
P i =W p head i +b p ,
wherein W p ,b p Is the parameter of the calculated score matrix P, T is a transition matrix, head i And for the output vector of Step3.2, predicting the globally optimal label sequence by adopting a Viterbi algorithm in the final decoding stage of label prediction.
In order to illustrate the effect of the invention, the invention carries out the following comparative experiments, and the adopted experimental data are all national culture artificial labeling data sets.
The evaluation index used was to evaluate the model by Precision (Precision), recall (Recall) and F1 value. The calculation methods of the accuracy, recall rate and F1 value are as follows.
In order to verify the effect of the cross-border national culture entity recognition model based on word set feature weighting, the following comparative test is designed for analysis. Compared with Bi-LSTM, lattice-LSTM, LR-CNN, FLAT and SoftLexicon (LSTM) entity identification methods, the specific experimental results are shown in Table 4.
TABLE 4 comparative experiments of different methods
Name of method | P(%) | R(%) | F1(%) |
Bi-LSTM+CRF | 83.59 | 91.52 | 87.38 |
Lattice-LSTM | 89.08 | 92.52 | 90.76 |
LR-CNN | 92.81 | 90.15 | 91.46 |
FLAT | 92.76 | 95.05 | 93.89 |
SoftLexicon(LSTM) | 90.68 | 93.39 | 92.01 |
The method of the invention | 95.56 | 94.01 | 94.72 |
Experiments show that compared with Bi-LSTM + CRF models, the method utilizes word set information to enhance text context semantic information, compared with Lattice-LSTM, LR-CNN, FLAT and SoftLexicon (LSTM) models, the method is integrated with a cross-border national culture field dictionary, and position coding is adopted to enhance word set position information, so that word sets matched through characters are more complete.
And table 5 shows that the cross-border national culture entity recognition method adopting word set characteristic weighting is adopted to carry out model training and respectively fuse the influence of position coding, word set characteristic weighting and position coding on the experimental result to carry out effect comparison.
TABLE 5 influence of word set feature weighting and position coding on the model
P(%) | R(%) | F1(%) | |
Integrating position coding | 94.72 | 93.25 | 93.98 |
Word set feature based weighting | 94.15 | 92.39 | 93.26 |
Merging position coding + word set feature weighting | 95.56 | 94.01 | 94.72 |
The experimental result shows that the integration of different coding information can affect the experimental result, when only the position coding is integrated in the model, the F1 value integrated in the model by the position coding and the word set characteristic weighting is reduced by 1.46%, the importance degree between four word set vectors is verified to be helped to be distinguished by the word set characteristic weighting, and when only the word set characteristic weighting is integrated in the model, the F1 value integrated in the model by the position coding and the word set characteristic weighting is reduced by 0.74%, the position coding can enhance the word set position information. When the position code and the word set characteristic weighting are added simultaneously, the word set information can be more fully acquired, and the accuracy of cross-border national culture entity recognition is further improved.
The following is an embodiment of the system of the present invention, and an embodiment of the present invention further provides a cross-border national cultural entity recognition apparatus based on word set feature weighting, which includes an integration module for performing the method of the first aspect. The method specifically comprises the following steps:
the cross-border ethnic culture data preprocessing module: the method is used for labeling and preprocessing the data of the cross-border national culture entity: performing character filtering on an input cross-border national culture sentence, segmenting the sentence into characters and performing character vector representation;
the cross-border national culture text feature representation module integrated with the word set feature information comprises: the method comprises the steps of obtaining a word set through cross-border national culture field dictionary matching, providing a word set characteristic weighting method and position information codes for obtaining word set information, and integrating the word set information into character vector representation;
a cross-border national culture entity recognition model training module based on word set characteristic weighting; the system is used for extracting the characteristics of the context of the cross-border national sentences by using the thought of a bidirectional gating circulation unit and performing entity recognition model training based on word set characteristic weighting by adopting optimal entity label probability calculation;
cross-border ethnic culture entity identification module: the cross-border national culture entity recognition method is used for performing cross-border national culture entity recognition after data preprocessing is performed on input texts by using a trained cross-border national culture entity recognition model.
In one possible implementation, the cross-border ethnic culture entity identification module is further configured to: and deploying the trained model to a local server side, converting the model into an application interface by the aid of a Sanic technology, directly calling the model through a webpage side, and outputting a predicted entity to a front-end interface for display.
In one possible implementation, the cross-border ethnic culture data preprocessing module is further configured to:
acquiring cross-border national culture data through a cross-border national culture website, performing de-duplication and special character filtering pretreatment on the data, and manually marking 15717 cross-border national culture sentences with entity labels, wherein the entities in the fields comprise positions, festivals, diet, customs, literature and buildings;
performing duplicate removal and special character filtering on the cross-border national culture data to construct a cross-border national culture field dictionary so as to obtain word set information later, training the cross-border national culture data acquired on the network by combining field words to obtain word vectors and construct a field dictionary, wherein the field dictionary contains words related to festivals, buildings, customs, diet, positions and literary and artistic activities in the cross-border national culture;
performing character vector representation on the cross-border national culture text by adopting a pre-training language model, processing the characters, and then inputting the characters into a Transformer Encoder layer to obtain vector representation of each character of the input text; and a final output of the Transformer Encoder is obtained through a series of normalization and linear processing, so that the dynamic generation of character vectors in the cross-border national culture text is realized.
While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.
Claims (4)
1. The cross-border ethnic culture entity recognition method based on word set feature weighting is characterized by comprising the following steps of: the method comprises the following specific steps:
step1, marking and preprocessing the cross-border national culture entity data: performing character filtering on an input cross-border national culture sentence, segmenting the sentence into characters and performing character vector representation;
step2, cross-border national culture text characteristic representation of the characteristic information of the merged word set: acquiring a word set through cross-border national culture field dictionary matching, and providing a word set characteristic weighting method and position information codes for acquiring word set information and integrating the word set information into character vector representation;
step3, training a cross-border ethnic culture entity recognition model based on word set characteristic weighting; extracting the characteristics of the context of the cross-border national sentences by using a bidirectional gating circulation unit, and performing entity recognition model training based on word set characteristic weighting by adopting optimal entity label probability calculation;
step4, recognizing cross-border ethnic culture entities: performing data preprocessing on the input text and performing cross-border national culture entity recognition by using a trained cross-border national culture entity recognition model;
the specific steps of Step2 are as follows:
step2.1, the word set is that all matched words are obtained from a dictionary through cross-border national culture characters, four word sets are formed according to the positions of the characters, and the words are divided into four word sets according to different positions of the characters in the matched words: the character is located at the head part B of the word, the character is located in the interior M of the word, the character is located at the tail part E of the word, and a single character is marked by four labels;
cross-border national culture sentence S = { c 1 ,c 2 ,…,c n }∈V c Character c in i The four position type word set matching rules of the matched words are as follows:
wherein, V w Representing a pre-constructed domain dictionary, w representing words existing in the domain dictionary, i representing positions of characters, j, k representing positions on two sides of the characters, and n representing the number of the characters in a sentence; v c Is a vocabulary at the character level;
step2.2, vectorizing feature representation of the word sets: by counting the word frequency of each word in the data set, because the word frequency represents the importance degree of the word, the word frequency is assigned to four types of word vectors by using a weighting method:
wherein, z (w) i ) Is the word w i Word frequency, e (w), counted in the data set i ) Is the word w i L represents one of the four word set types B, M, E, S, v i (L) is a word set vector;
step2.3, in order to fully consider the importance degree among the four word set vectors, acquiring the importance degree among the word set vectors by using a word set characteristic weighting method, so that the important word set vectors can acquire more weights; using the word set vector v obtained by Step2.2 i (L)={v i (B),v i (M),v i (E),v i (S) } obtaining a weight matrix W through neural network training v The final weight vector is then output by the Softmax function:
V i =W v [v i (B);v i (M);v i (E);v i (S)]+b v ,α i =Softmax(V i ).
wherein, W v Dimension is a training parameter, b v For the offset parameter of the neural network, the Softmax function is normalization operation, and finally a value range (0,1) is obtainedWeight vector alpha i ;
Step2.4, word set information is merged into the character vector representation: in order to retain domain dictionary information, each character vector and four types of word set vectors corresponding to the character are combined into a feature vector, and the feature vector jointly form the final representation of the character:
e i (B,M,E,S)=[α i1 v i (B);α i2 v i (M);α i3 v i (E);α i4 v i (S)],
x i =[g i ;e i (B,M,E,S)].
wherein v is i (L)={v i (B),v i (M),v i (E),v i (S)},[α i1 ,α i2 ,α i3 ,α i4 ]=α i Is the weight vector obtained in Step2.3, e i (B, M, W, S) represents the feature vector of the four types of mosaics, x i Feature vectors, g, representing information of the merged set of words i The character vectors are character vectors in cross-border national culture texts.
2. The method for recognizing the trans-border national culture entity based on the word set feature weighting as claimed in claim 1, wherein: the specific steps of Step1 are as follows:
step1.1, acquiring cross-border national culture data through a cross-border national culture website, performing de-duplication and special character filtering pretreatment on the data, and then manually marking 15717 cross-border national culture sentences with entity labels, wherein the entities in the fields comprise positions, festivals, diet, customs, literature and buildings;
step1.2, performing duplicate removal and special character filtering on the cross-border national culture data to construct a cross-border national culture field dictionary so as to obtain word set information later, training the cross-border national culture data obtained on the network by combining field words to obtain word vectors and construct a field dictionary, wherein the field dictionary contains words related to festivals, buildings, customs, diet, positions and literature and art in the cross-border national culture;
step1.3, performing character vector representation on the cross-border national culture text by adopting a pre-training language model, processing characters, and then inputting the characters into a Transformer Encoder layer to obtain vector representation of each character of the input text; and a final output of the Transformer Encoder is obtained through a series of normalization and linear processing, so that the dynamic generation of character vectors in the cross-border national culture text is realized.
3. The cross-border ethnic culture entity recognition method based on word set feature weighting as claimed in claim 1, wherein: the specific steps of Step3 are as follows:
step3.1, aiming at the problem of dependence of combined characteristic words in the cross-border national culture text, integrating the cross-border national culture text characteristic vector x of word set information i The context dependent feature information is extracted by inputting the context dependent feature information into a reset gate and an update gate in a gated loop unit respectively:
r i =σ(W r ·[x i ,h i-1 ]),u i =σ(W u ·[x i ,h i-1 ]),
where σ is the sigmoid activation function, x i To incorporate the token vector of the word information, h i-1 Is a hidden state at the previous moment, r i Is a reset gate u i Is to update the door, W r ,W u ,For training parameters, tanh (-) is an activation function, and a long-term dependence feature vector h of the cross-border national culture text context is obtained i ;
And Step3.2, performing optimal label probability calculation on the feature vectors, and predicting the entity labels through a cross-border ethnic culture entity recognition model.
4. A cross-border national culture entity recognition device based on word set feature weighting is characterized by comprising the following modules:
the cross-border national culture data preprocessing module: the method is used for labeling and preprocessing the data of the cross-border national culture entity: performing character filtering on an input cross-border national culture sentence, segmenting the sentence into characters and performing character vector representation;
the cross-border national culture text feature representation module integrated with the word set feature information comprises: the method comprises the steps of obtaining a word set through cross-border national culture field dictionary matching, providing a word set characteristic weighting method and position information codes for obtaining word set information, and integrating the word set information into character vector representation;
the cross-border national culture entity recognition model training module based on word set feature weighting comprises: the system is used for extracting the characteristics of the context of the cross-border national sentences by using the thought of a bidirectional gating circulation unit and performing entity recognition model training based on word set characteristic weighting by adopting optimal entity label probability calculation;
the cross-border ethnic culture entity recognition module: the cross-border national culture entity recognition method is used for performing data preprocessing on input texts and then performing cross-border national culture entity recognition by using a trained cross-border national culture entity recognition model;
the specific steps of the cross-border national culture text feature representation of the character information of the word set are as follows:
step2.1, the word set is that all matched words are obtained from a dictionary through cross-border national culture characters, four word sets are formed according to the positions of the characters, and the words are divided into four word sets according to different positions of the characters in the matched words: the character is located at the head part B of the word, the character is located in the interior M of the word, the character is located at the tail part E of the word, and a single character is marked by four labels;
cross-border national culture sentence S = { c = 1 ,c 2 ,…,c n }∈V c Character c in i The four position type word set matching rules of the matched words are as follows:
wherein, V w Representing a pre-constructed domain dictionary, w representing words existing in the domain dictionary, i representing positions of characters, j, k representing positions of two sides of the characters, and n representing the number of the characters in a sentence; v c Is a vocabulary at the character level;
step2.2, performing vectorization feature representation on the word set: by counting the word frequency of each word in the data set, since the word frequency represents the importance degree of the word, the word frequency is assigned to four types of word vectors by using a weighting method:
wherein, z (w) i ) Is the word w i Word frequency, e (w), counted in the data set i ) Is the word w i L represents one of the four word set types B, M, E, S, v i (L) is a word set vector;
step2.3, in order to fully consider the importance degree among the four word set vectors, acquiring the importance degree among the word set vectors by using a word set characteristic weighting method, so that the important word set vectors can acquire more weights; using the word set vector v obtained by Step2.2 i (L)={v i (B),v i (M),v i (E),v i (S) } obtaining a weight matrix W through neural network training v The final weight vector is then output by the Softmax function:
V i =W v [v i (B);v i (M);v i (E);v i (S)]+b v ,α i =Softmax(V i ).
wherein, W v Dimension is a training parameter, b v For the offset parameter of the neural network, the Softmax function is normalization operation, and finally a weight vector alpha with the value range of (0,1) is obtained i ;
Step2.4, word set information is merged into the character vector representation: in order to retain domain dictionary information, each character vector and four types of word set vectors corresponding to the character are combined into a feature vector, and the feature vector jointly form the final representation of the character:
e i (B,M,E,S)=[α i1 v i (B);α i2 v i (M);α i3 v i (E);α i4 v i (S)],
x i =[g i ;e i (B,M,E,S)].
wherein v is i (L)={v i (B),v i (M),v i (E),v i (S)},[α i1 ,α i2 ,α i3 ,α i4 ]=α i Is the weight vector obtained in Step2.3, e i (B, M, W, S) represents the feature vector of the four types of mosaics, x i Feature vectors, g, representing information of the merged word set i The character vectors are character vectors in cross-border national culture texts.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111068293.3A CN113935324B (en) | 2021-09-13 | 2021-09-13 | Cross-border national culture entity identification method and device based on word set feature weighting |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111068293.3A CN113935324B (en) | 2021-09-13 | 2021-09-13 | Cross-border national culture entity identification method and device based on word set feature weighting |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113935324A CN113935324A (en) | 2022-01-14 |
CN113935324B true CN113935324B (en) | 2022-10-28 |
Family
ID=79275641
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111068293.3A Active CN113935324B (en) | 2021-09-13 | 2021-09-13 | Cross-border national culture entity identification method and device based on word set feature weighting |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113935324B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114580422B (en) * | 2022-03-14 | 2022-12-13 | 昆明理工大学 | Named entity identification method combining two-stage classification of neighbor analysis |
CN114970537B (en) * | 2022-06-27 | 2024-04-23 | 昆明理工大学 | Cross-border ethnic cultural entity relation extraction method and device based on multi-layer labeling strategy |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112101028A (en) * | 2020-08-17 | 2020-12-18 | 淮阴工学院 | Multi-feature bidirectional gating field expert entity extraction method and system |
EP3767516A1 (en) * | 2019-07-18 | 2021-01-20 | Ricoh Company, Ltd. | Named entity recognition method, apparatus, and computer-readable recording medium |
CN113033206A (en) * | 2021-04-01 | 2021-06-25 | 重庆交通大学 | Bridge detection field text entity identification method based on machine reading understanding |
CN113128232A (en) * | 2021-05-11 | 2021-07-16 | 济南大学 | Named entity recognition method based on ALBERT and multi-word information embedding |
-
2021
- 2021-09-13 CN CN202111068293.3A patent/CN113935324B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP3767516A1 (en) * | 2019-07-18 | 2021-01-20 | Ricoh Company, Ltd. | Named entity recognition method, apparatus, and computer-readable recording medium |
CN112101028A (en) * | 2020-08-17 | 2020-12-18 | 淮阴工学院 | Multi-feature bidirectional gating field expert entity extraction method and system |
CN113033206A (en) * | 2021-04-01 | 2021-06-25 | 重庆交通大学 | Bridge detection field text entity identification method based on machine reading understanding |
CN113128232A (en) * | 2021-05-11 | 2021-07-16 | 济南大学 | Named entity recognition method based on ALBERT and multi-word information embedding |
Non-Patent Citations (1)
Title |
---|
A Novel Dual Pointer Approach for Entity Mention Extraction;LIU Jie等;《Chinese Journal of Electronics》;20210131;全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN113935324A (en) | 2022-01-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110298037B (en) | Convolutional neural network matching text recognition method based on enhanced attention mechanism | |
CN110377903B (en) | Sentence-level entity and relation combined extraction method | |
CN109753660B (en) | LSTM-based winning bid web page named entity extraction method | |
CN106383816B (en) | The recognition methods of Chinese minority area place name based on deep learning | |
Silberer et al. | Visually grounded meaning representations | |
CN109684642B (en) | Abstract extraction method combining page parsing rule and NLP text vectorization | |
CN110222163A (en) | A kind of intelligent answer method and system merging CNN and two-way LSTM | |
CN107909115B (en) | Image Chinese subtitle generating method | |
CN112257449B (en) | Named entity recognition method and device, computer equipment and storage medium | |
CN113935324B (en) | Cross-border national culture entity identification method and device based on word set feature weighting | |
CN114036933B (en) | Information extraction method based on legal documents | |
CN110765769B (en) | Clause feature-based entity attribute dependency emotion analysis method | |
CN113392209B (en) | Text clustering method based on artificial intelligence, related equipment and storage medium | |
CN110555084A (en) | remote supervision relation classification method based on PCNN and multi-layer attention | |
CN110852089B (en) | Operation and maintenance project management method based on intelligent word segmentation and deep learning | |
CN114818717B (en) | Chinese named entity recognition method and system integrating vocabulary and syntax information | |
CN110851593B (en) | Complex value word vector construction method based on position and semantics | |
CN110569506A (en) | Medical named entity recognition method based on medical dictionary | |
CN110263174A (en) | - subject categories the analysis method based on focus | |
CN116842168B (en) | Cross-domain problem processing method and device, electronic equipment and storage medium | |
Chauhan et al. | Enhanced unsupervised neural machine translation by cross lingual sense embedding and filtered back-translation for morphological and endangered Indic languages | |
CN117094325B (en) | Named entity identification method in rice pest field | |
CN116757195B (en) | Implicit emotion recognition method based on prompt learning | |
CN117851591A (en) | Multi-label long text classification method based on BIGBIRD and graph annotation meaning network | |
CN112434512A (en) | New word determining method and device in combination with context |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |