CN113935324A

CN113935324A - Cross-border national culture entity identification method and device based on word set feature weighting

Info

Publication number: CN113935324A
Application number: CN202111068293.3A
Authority: CN
Inventors: 毛存礼; 杨振平; 余正涛; 高盛祥; 黄于欣; 郭军军
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2021-09-13
Filing date: 2021-09-13
Publication date: 2022-01-14
Anticipated expiration: 2041-09-13
Also published as: CN113935324B

Abstract

The invention relates to a cross-border ethnic culture entity recognition method and device based on word set characteristic weighting, and belongs to the technical field of natural language processing. The invention provides a cross-border national culture entity recognition method based on word set characteristic weighting, which aims at the characteristics of cross-border national culture entities and comprises four parts, namely cross-border national culture entity data labeling and data preprocessing, cross-border national culture text characteristic representation of word set characteristic information, cross-border national culture entity recognition models based on word set characteristic weighting and cross-border national culture entity recognition. The cross-border national culture entity recognition device based on word set characteristic weighting is manufactured according to the four parts of functional modules, and entity recognition is carried out on input sentences.

Description

Cross-border national culture entity identification method and device based on word set feature weighting

Technical Field

The invention relates to a cross-border ethnic culture entity recognition method and device based on word set characteristic weighting, and belongs to the technical field of natural language processing.

Background

The information extraction comprises entity identification, relation extraction and event extraction, wherein the entity identification is a basic task in the information extraction, the entity identification needs to determine entity boundaries and classify the entity boundaries into predefined entity types, and the domain knowledge graph can be favorably expanded and information retrieval can be supported by mining cross-border ethnic culture entities. Entities related to cross-border national culture are automatically marked from the Internet by utilizing an entity identification technology, so that the time for researchers to manually extract and process information is shortened. The method integrates the lexical characteristics into the entity recognition model to solve the problem of entity boundary ambiguity in cross-border national culture. The method has the advantages that the word set characteristics are integrated into the entity recognition model, so that a good effect can be achieved, the problem of fuzzy domain word boundaries is relieved, and the representation of text semantic information is enhanced. The cross-border national culture entity is usually formed by combining field vocabularies describing national culture characteristics, a large number of field words exist in cross-border national culture data, for example, the 'Sangjing Bimai' is a nickname of 'splash festival', the field words belong to festival type entities, and an entity recognition method integrating word set information can obtain a good effect.

Disclosure of Invention

The invention provides a cross-border national culture entity recognition method and device based on word set characteristic weighting, which aims to improve entity recognition of fuzzy boundary of cross-border national culture entities and enhance cross-border national culture text representation of integrated word set information.

The technical scheme of the invention is as follows: in a first aspect, a method for recognizing a cross-border national culture entity based on word set feature weighting comprises the following specific steps:

step1, marking and preprocessing the data of the cross-border ethnic culture entity: performing character filtering on an input cross-border national culture sentence, segmenting the sentence into characters and performing character vector representation;

due to the lack of entity data sets in the cross-border national culture field, six types of entity types including diet, festivals, custom and the like are defined by combining the characteristics of a large number of field entities in cross-border national culture data, 15717 cross-border national culture data sets with entity labels are marked in an artificial mode, and the data sets play a good supporting role in entity recognition model training.

Step2, cross-border national culture text feature representation of the feature information of the merged word set: acquiring a word set through cross-border national culture field dictionary matching, and providing a word set characteristic weighting method and position information codes for acquiring word set information and integrating the word set information into character vector representation;

the cross-border national culture entity is usually formed by combining field vocabularies describing national culture characteristics, such as a 'Meng Yong soil pan' in dietary culture, and because word sets comprise word boundaries and word meaning information, corresponding rules are formulated to match with a cross-border national culture field dictionary to obtain four word sets, a word set characteristic weighting method and position information codes are provided for obtaining word set information, and cross-border national culture characteristic semantic information is enhanced.

Step3, training a cross-border national culture entity recognition model based on word set characteristic weighting; extracting the characteristics of the context of the cross-border national sentences by using the thought of a bidirectional gating circulation unit, and performing entity recognition model training based on word set characteristic weighting by adopting optimal entity label probability calculation;

in order to enable the model to obtain the semantic information of the context of the cross-border national culture text, for example, the vector representation of the sentence Dai nationality grass grilled fish as a special food needs to be associated with the context 'citronella grass', aiming at the problem of word dependence of combination characteristics, the idea of a bidirectional gating circulation unit is proposed to be integrated into the invention to extract the characteristics of the cross-border national sentence context, and the entity recognition model based on word set characteristic weighting is trained by adopting optimal entity label probability calculation.

Step4, recognizing cross-border ethnic culture entities: by using the trained cross-border national culture entity recognition model, the cross-border national culture entity recognition is carried out after data preprocessing is carried out on the input text.

As a preferable scheme of the invention, the Step1 comprises the following specific steps:

step1.1, obtain cross-border national culture data through cross-border national culture website, the data carries out preprocessing such as deduplication, filter special character, and every cross-border national culture sentence has all marked corresponding entity label, for example the sentence "Dai nationality has the glutinous rice goods of many unique characteristics: such as fragrant bamboo rice, glutinous rice, rice cake, etc. "the entity of the sentence is marked by manpower as" bamboo rice-diet culture, glutinous rice-diet culture, metrorrhagia-diet culture, thousand-layer rice cake-diet culture ". By utilizing the method, 15717 cross-border national culture sentences with entity labels are artificially labeled, the entity types in the fields comprise positions, festivals, diet, custom, literature and buildings, and the analysis of the entity types is shown in the table 1:

TABLE 1 Cross-border ethnic culture entity type analysis

The character entity label for segmenting the cross-border national culture sentence has a plurality of specifications, such as labels of 'BIO', 'BMESO' and the like, because most of entities in the cross-border national culture field are composed of combined features, the cross-border national culture sentence is separated from the label through character segmentation, each character is labeled by using a 'BMESO' labeling method, wherein B represents an entity starting position, M represents an entity internal position, E represents an entity ending position, S represents a single entity, and O represents a non-entity. For example, the sentence \36181, the label corresponding to the separated folk of the Dai nationality is 'B-XS E-XS O O O O O O O O O', the B-XS represents the beginning label of the folk of the entity type, the E-XS represents the ending label of the folk of the entity type, and the O represents the non-entity label. The defined cross-border ethnic culture entity tag format is shown in table 2:

TABLE 2 Cross-border ethnic culture entity label format

Entity name	Entity type	Entity label
			Ruili (a Chinese character of' Ruili	Position of	B-WZ/E-WZ
Water-splashing water-saving device	Festival culture	B-JR/M-JR/E-JR
			Canarium album	Food culture	B-YS/M-YS/E-YS
\36181 ` Buddha	Custom culture	B-XS/E-XS
			Hand dancing	Culture of literature and art	B-WY/M-WY/E-WY
Soil palm room	Building culture	B-JZ/M-JZ/E-JZ

Step1.2, processing the cross-border national culture data to remove weight, filter special characters and the like to construct a cross-border national culture field dictionary so as to obtain word set information later, enhancing sentence semantic information by constructing the cross-border national culture field dictionary, training and constructing the field dictionary by combining the cross-border national culture data acquired on the network by field words, wherein the field dictionary contains words related to festivals, buildings, customs, diet, positions and literature in the cross-border national culture, such as cross-border national culture words like 'Nola dance (literature), Bei river county (position), Ara Da Gong (building), curry crab mango fragrant rice (diet), bath Buddha form (custom), summer festival (festival)'.

Step1.3, performing character vector representation on the cross-border national culture text by adopting a pre-training language model, performing special treatment on the characters, and then inputting the characters into a Transformer Encoder layer to obtain the vector representation of each character of the input text. For example, the text "Dai peacock dancing" is represented as E ═ c after the bitwise addition of three Embedding elements_[CLS],c_{Dai nationality's nationality},c_{Family of people},c_Hole(s),c_Sparrow,c_Dancing,c_[SEP]In which c is_[CLS]And c_[SEP]A special token vector for text. The final output of the Transformer Encoder can be obtained through a series of normalization and linear processing. The cross-border ethnic culture sentence is regarded as a character sequence S ═ { c₁,c₂，…,c_n}∈V_cIn which V is_cIs a vocabulary of character level, c_iIs shown in lengthFor the ith character in the sentence S of n, the idea of the pre-trained language model is applied to each character c of the cross-border ethnic culture entity_iPerforming word vector representation:

Q＝c_i×W^Q,K＝c_i×W^K,V＝c_i×W^V,

g_i＝Attention(Q,K,V).

wherein, W^Q,W^K,W^VRepresenting a weight parameter, d_kSoftmax is the normalization operation for the dimension of the input feature vector. The final output of the Transformer Encoder can be obtained through a series of normalization and linear processing, and the dynamic generation of the character vectors in the cross-border national culture text is realized by continuously carrying out the processes on each character in the text.

As a preferable scheme of the invention, the Step2 comprises the following specific steps:

step2.1, cross-border national culture field word set matching method: the word set is obtained by cross-border ethnic culture characters from a dictionary and forms four word sets according to the character positions. The domain dictionary contains word boundary information and cross-border national culture text semantic information, and the boundary information and the semantic information in the matched words can be reserved through character matching. Character c_iMatching with a domain dictionary to obtain different words, and dividing the words into four word set types according to different positions of the characters in the matched words: the character is located at the head (B) of the word, the character is located at the inner part (M) of the word, the character is located at the tail (E) of the word, and a single character (S) is marked by four labels. For example, the entity "crisp beef jerky" in the dietary culture, the word set matched by the following formula for the character "cattle" is B ═ { beef, beef jerky }, M ═ crisp beef jerky, crisp beef }, E ═ crisp beef }, S ═ beef }. For example, the entity "pineapple purple rice" in the food culture, the word set matched by the character "rice" through the following formula is B ═ rice }, and M ═ pineapple purple rice }Rice, purple rice, E ═ purple rice, and S ═ rice.

Cross border national culture sentence S ═ { c ═ c₁,c₂，…,c_n}∈V_cCharacter c in_iFour position type word set matching modes of the matched words are as follows:

wherein, V_wRepresenting a pre-constructed domain dictionary, w representing words existing in the domain dictionary, i representing positions of characters, j, k representing positions on both sides of the characters, and n representing the number of characters in a sentence.

Step2.2, acquiring a word set vector: by counting the word frequency of each word in the data set, because the word frequency can represent the importance degree of the word, the four types of word vectors are endowed with corresponding word frequencies by using a weighting method. The word frequency of words matched with characters in the cross-border national culture text is fused into word vectors, and the word vectors in each type are spliced to obtain word set vector representation of each type:

wherein, z (w)_i) Is the word w_iWord frequency, e (w), counted in the data set_i) Is the word w_iCorresponding dimension is d^wA word vector representation of 50. L represents one of four types of { B, M, E, S }, v_i(L) is a word set vector with 1 × d dimensions^w。

Step2.3, weighting the feature of the word set to obtain the importance degree among the word set vectors: word set vector v_i(L)＝{v_i(B),v_i(M),v_i(E),v_i(S) } is obtained by word vector concatenation in each type, onlyIs a word vector with different weights computed in each type. In order to fully consider the importance degree among the four types of word set vectors, the importance degree among the word set vectors is obtained by using a word set characteristic weighting method, so that more weights can be obtained by the important word set vectors. Using the word set vector v obtained by Step2.2_i(L)＝{v_i(B),v_i(M),v_i(E),v_i(S) } obtaining a weight matrix W through neural network training^vThe final weight vector is then output by the Softmax function:

V_i＝W^v[v_i(B)；v_i(M)；v_i(E)；v_i(S)]+b^v

α_i＝Softmax(V_i)

wherein, W^vDimension of 1 × d^wTraining parameters of d^w＝50，b^vThe dimension is offset of 1 × 4, and the Softmax function is a normalization operation. Finally, a weight vector alpha with dimension of 1 multiplied by 4 and value range of (0,1) is obtained_i。

Step2.4, position coding enhanced position information: the character positions in the cross-border national culture text contain word boundary information, words matched according to the positions of the characters are different, so that position codes are added into word set vectors, four types of word set vectors are distinguished according to the positions of the characters, the four types of positions are vectorized and expressed by adopting the vectors, and the word set vectors fused with the position codes are expressed as follows:

v_i(B)＝p_i(B)W^L+v_i(B)

v_i(M)＝p_i(M)W^L+v_i(M)

v_i(E)＝p_i(E)W^L+v_i(E)

v_i(S)＝p_i(S)W^L+v_i(S)

wherein p is_i(B)＝[1,0,0,0],p_i(M)＝[0,1,0,0],p_i(E)＝[0,0,1,0],p_i(S)＝[0,0,0,1],W^LIs a 4 xd^wTraining parameters of d^w＝50。

Step2.5, word set information is merged into the character vector representation: in order to retain as much domain dictionary information as possible, each character vector and the four types of word set vectors corresponding to the character are combined into a feature vector, and the feature vector together form the final representation of the character:

e_i(B,M,E,S)＝[α_i1v_i(B)；α_i2v_i(M)；α_i3v_i(E)；α_i4v_i(S)],

x_i＝[g_i；e_i(B,M,E,S)].

wherein, [ alpha ] is_i1,α_i2,α_i3,α_i4]＝α_iAs a weight vector, e_i(B, M, W, S) represents the feature vector for four types of stitching, x_iFeature vectors, g, representing information of the merged set of words_iIs the character vector in Step1.3.

As a preferable scheme of the invention, the Step3 comprises the following specific steps:

step3.1, aiming at the problem of dependence of combined characteristic words in the cross-border national culture text, the characteristic vector x of the word set information fused into Step2.5_iThe information is input into a reset gate and an update gate in a bidirectional gated loop unit (GRU), respectively, and the control information of the reset gate is lost, so that the information content in the past is determined to be forgotten, and the information content is retained and combined with the input of the current time step. When r approaches 0, the previous hidden state is ignored and only the current input is used for reset. The update gate decides how much information to pass to the next state, which allows the model to copy all the information from the previous state to reduce the risk of gradient disappearance. The information of the reset gate and the refresh gate is represented as follows

r_i＝σ(W^r·[x_i,h_i-1])

u_i＝σ(W^u·[x_i,h_i-1])

Where σ is the sigmoid activation function, x_iTo incorporate the token vector of the word information, h_i-1At the last momentHidden state r_iIs a reset gate u_iIs to update the door, W^r,W^uAre training parameters.

New hidden state h_iFrom the previous hidden state h_i-1And the current input x_iAnd calculating to obtain the target product.

Wherein,

is a training parameter, and tanh (-) is an activation function. Feature vector h obtained based on bidirectional GRU coding layer_iThe long-term dependency relationship between the contextual information in the cross-border ethnic culture text is obtained.

Step3.2, considering the dependency relationship among the cross-border national culture entity labels, avoiding the error condition existing in the cross-border national culture entity identification, for example, the unreasonable condition that the entity label of 'splash water festival' is 'B-JR M-JR E-JR', and the entity label 'M-YS' appears behind the 'B-JR' in the training process and diet is taken as an internal label, carrying out optimal label probability calculation on the feature vector, and predicting the entity label through a cross-border national culture entity identification model.

P_i＝W_ph_i+b_p,

Wherein W_p,b_pIs the parameter of the calculated score matrix P, T is a transition matrix, h_iThe vector is output for Step3.1.

Using the self-attention mechanism to extract the importance of adjacent feature vectors, enhance useful features andreducing features that are not useful. Feature vector h after bidirectional GRU coding_iAnd calculating the corresponding weight of the feature vector by using a self-attention mechanism.

Q＝h_i×W^Q,K＝h_i×W^K,V＝h_i×W^V,

head_i＝Attention(Q,K,V).

Wherein, W^Q,W^K,W^VRepresenting a weight parameter, d_kSoftmax is the normalization operation for the dimension of the input feature vector.

Reflecting the relevance and importance degree between the feature vectors through a self-attention mechanism, completing the cross-border national culture entity recognition, giving corresponding weight to all the feature vectors according to the influence because the feature vectors have different influences on the entity recognition, and then obtaining the final output vector head_i. The self-attention mechanism can further improve the degree of distinction of importance among the components of the feature vector, thereby being beneficial to the identification of the cross-border national culture entity.

In a second aspect, an embodiment of the present invention further provides a cross-border ethnic cultural entity recognition apparatus weighted based on word set features, which includes modules for performing the method of the first aspect.

The invention has the beneficial effects that:

1. the invention integrates the word set information into the entity recognition model, the word set obtained by the dictionary of the character matching field contains entity boundary information, and the word set is utilized to realize the enhancement of the cross-border national culture text semantic information, so that the model can achieve better effect on the cross-border national culture entity recognition.

2. The method obtains the importance degree among the word set vectors based on the word set characteristic weighting, and utilizes the position coding to enhance the word set position information matched with the characters, so that the characteristics of the word set and the vectors are richer. The word set features are merged into the character representation, so that the problem that entity boundary ambiguity exists in entity recognition based on character representation, and entity recognition errors are caused.

Drawings

FIG. 1 is a word set information diagram based on word set feature weighting in the present invention;

FIG. 2 is a diagram illustrating exemplary word frequency statistics in the present invention;

FIG. 3 is a cross-border national culture entity recognition frame diagram based on word set feature weighting in the present invention;

fig. 4 is an overall flow chart of the present invention.

Detailed Description

Example 1: as shown in fig. 1 to 4, in a first aspect, a method for recognizing a cross-border national culture entity based on word set feature weighting specifically includes the following steps:

step1.1, acquiring related cross-border national culture data on a cross-border national culture website, and manually marking 15717 cross-border national culture sentences with entity labels, wherein entity types are defined as 6 types: location, holiday culture, diet culture, custom culture, literary and artistic culture, and architectural culture; segmenting characters and corresponding labels in the cross-border national culture sentences to enable each character to correspond to one label, wherein the format of the corresponding entity label is shown in a table 3:

TABLE 3 Cross-border ethnic culture entity label format

Step1.2, sentence semantic information is enhanced by constructing a cross-border national culture field dictionary, the field dictionary is trained and constructed through cross-border national culture data acquired on the network in combination with field words, and the field dictionary contains words related to festivals, buildings, customs, diet, positions and literary arts in cross-border national culture, such as cross-border national culture words of 'Nola dance (literary composition), North river county (position), southeast Dagong (building), curry crab mango rice (diet), bathing Buddha ceremony (custom), summer festival (festival)'.

Step1.3, performing character vector representation on the cross-border national culture text by adopting a pre-training language model, performing special treatment on the characters, and then inputting the characters into a Transformer Encoder layer to obtain the vector representation of each character of the input text. For example, the text "Dai peacock dancing" is represented as E ═ c after the bitwise addition of three Embedding elements_[CLS],c_{Dai nationality's nationality},c_{Family of people},c_Hole(s),c_Sparrow,c_Dancing,c_[SEP]In which c is_[CLS]And c_[SEP]A special token vector for text. The final output of the Transformer Encoder can be obtained through a series of normalization and linear processing. The cross-border ethnic culture sentence is regarded as a character sequence S ═ { c₁,c₂，…,c_n}∈V_cIn which V is_cIs a vocabulary of character level, c_iRepresenting the ith character in a sentence S of length n, the idea of the pre-trained language model for each character c of the cross-border cultural ethnic entity_iPerforming word vector representation:

Q＝c_i×W^Q,K＝c_i×W^K,V＝c_i×W^V,

g_i＝Attention(Q,K,V).

step2.1, matching word sets in the cross-border national culture field: the word set is obtained by cross-border ethnic culture characters from a dictionary and forms four word sets according to the character positions. The domain dictionary contains word boundary information and cross-border national culture text semantic information, and the boundary information and the semantic information in the matched words can be reserved through character matching. Character c_iMatching with a domain dictionary to obtain different words, and dividing the words into four word set types according to different positions of the characters in the matched words: the character is located at the head (B) of the word, the character is located at the inner part (M) of the word, the character is located at the tail (E) of the word, and a single character (S) is marked by four labels.

Cross border national culture sentence S ═ { c ═ c₁,c₂,…,c_n}∈V_cCharacter c in_iFour position type word set matching modes of the matched words are as follows:

wherein V_wRepresenting a pre-constructed domain dictionary, w representing words existing in the domain dictionary, i representing positions of characters, j, k representing positions on both sides of the characters, and n representing the number of characters in a sentence.

Step2.2, acquiring a word set vector: as shown in fig. 2, the word frequency of the matching word is counted, and the word frequency of the matching word is merged into the word vector because the word frequency can represent the importance degree of the word, and the four types of word vectors are assigned with corresponding word frequencies by using a weighting method. The word frequency of words matched with characters in the cross-border national culture text is fused into word vectors, and the word vectors in each type are spliced to obtain word set vector representation of each type:

Step2.3, weighting the feature of the word set to obtain the importance degree among the word set vectors: word set vector v_i(L) is obtained by word vector concatenation in each type, only word vectors with different weights in each type are calculated. In order to fully consider the importance degree among the four types of word set vectors, the importance degree among the word set vectors is obtained by using a word set characteristic weighting method, and a weight matrix W is obtained through neural network training^vThe final weight vector is then output by the Softmax function:

V_i＝W^v[v_i(B)；v_i(M)；v_i(E)；v_i(S)]+b^v

α_i＝Softmax(V_i)

v_i(B)＝p_i(B)W^L+v_i(B)

v_i(M)＝p_i(M)W^L+v_i(M)

v_i(E)＝p_i(E)W^L+v_i(E)

v_i(S)＝p_i(S)W^L+v_i(S)

e_i(B,M,E,S)＝[α_i1v_i(B)；α_i2v_i(M)；α_i3v_i(E)；α_i4v_i(S)],

x_i＝[g_i；e_i(B,M,E,S)].

wherein, [ alpha ] is_i1,α_i2,α_i3,α_i4]＝α_iAs a weight vector, e_i(B, M, W, S) represents the feature vector for four types of stitching, x_iFeature vectors, g, representing information of the merged set of words_iIs a character vector.

step3.1, utilizing bidirectional GRU to extract the characteristics of the vector representation of the cross-border national culture integrated word set information, and integrating the characteristic vector x of the word set information_iThe information of the reset gate is lost, which determines how much information content needs to be forgotten, and how much information content is reserved to be combined with the input of the current time step. When r approaches 0, the previous hidden state is ignored and only the current input is used for reset. The update gate decides how much information to pass to the next state, which allows the model to copy all the information from the previous state to reduce the risk of gradient disappearance.The information of the reset gate and the refresh gate is represented as follows

r_i＝σ(W^r·[x_i,h_i-1])

u_i＝σ(W^u·[x_i,h_i-1])

Where σ is the sigmoid activation function, x_iTo incorporate the token vector of the word information, h_i-1Is the hidden state at the previous moment, r_iIs a reset gate u_iIs to update the door, W^r,W^uAre training parameters.

In a bidirectional GRU, a new hidden state h_iFrom the previous hidden state h_i-1And the current input x_iAnd calculating to obtain the target product.

Wherein,

Step3.2, the degree of importance of the self-attention mechanism for extracting adjacent feature vectors, enhances useful features and reduces less useful features. Feature vector h after bidirectional GRU coding_iThe feature vector weights are calculated using the self-attention mechanism:

Q＝h_i×W^Q,K＝h_i×W^K,V＝h_i×W^V,

head_i＝Attention(Q,K,V).

wherein, W^Q,W^K,W^VRepresenting a weight parameter, d_kThe dimension of the input feature vector is 50, and Softmax is the normalization operation.

As a preferable scheme of the invention, the Step4 comprises the following specific steps:

step4.1, through the idea of global optimization, a global optimal tag sequence is obtained by considering the dependency relationship among tags, so that some error conditions are prevented, for example, unreasonable conditions such as access of a 'diet' after a 'holiday' of a tag occur.

By the character s ═ c in cross-border ethnic culture text₁,c₂，…,c_n}∈V_cCorresponding predicted tag sequence y ═ y₁,y₂，…,y_nCalculating the probability:

P_i＝W_phead_i+b_p,

wherein W_p,b_pIs the parameter of the calculated score matrix P, T is a transition matrix, head_iAnd for the output vector of Step3.2, predicting the globally optimal label sequence by adopting a Viterbi algorithm in the final decoding stage of label prediction.

In order to illustrate the effect of the invention, the invention carries out the following comparative experiments, and the adopted experimental data are all national culture artificial labeling data sets.

The evaluation index used was to evaluate the model by Precision (Precision), Recall (Recall) and F1 values. The calculation methods of the accuracy, recall and F1 values are as follows.

In order to verify the effect of the cross-border national culture entity recognition model based on word set feature weighting, the following comparative test is designed for analysis. Compared with Bi-LSTM, Lattice-LSTM, LR-CNN, FLAT and SoftLexicon (LSTM) entity identification methods, the specific experimental results are shown in Table 4.

TABLE 4 comparative experiments of different methods

Name of method	P(％)	R(％)	F1(％)
				Bi-LSTM+CRF	83.59	91.52	87.38
Lattice-LSTM	89.08	92.52	90.76
				LR-CNN	92.81	90.15	91.46
FLAT	92.76	95.05	93.89
				SoftLexicon(LSTM)	90.68	93.39	92.01
The method of the invention	95.56	94.01	94.72

Experiments show that compared with Bi-LSTM + CRF models, the method utilizes word set information to enhance text context semantic information, compared with Lattice-LSTM, LR-CNN, FLAT and SoftLexicon (LSTM) models, the method is integrated with a cross-border national culture field dictionary, and position coding is adopted to enhance word set position information, so that word sets matched through characters are more complete.

And table 5 shows that the cross-border national culture entity recognition method adopting word set characteristic weighting is adopted to carry out model training and respectively fuse the influence of position coding, word set characteristic weighting and position coding on the experimental result to carry out effect comparison.

TABLE 5 influence of word set feature weighting and position coding on the model

	P(％)	R(％)	F1(％)
				Integrating position coding	94.72	93.25	93.98
Word set feature based weighting	94.15	92.39	93.26
				Merging position coding + word set feature weighting	95.56	94.01	94.72

The experimental result shows that the integration of different coding information can affect the experimental result, when only the position coding is integrated in the model, the F1 value integrated in the model by the position coding and the word set characteristic weighting is reduced by 1.46%, the importance degree between four word set vectors is verified to be helped to be distinguished by the word set characteristic weighting, and when only the word set characteristic weighting is integrated in the model, the F1 value integrated in the model by the position coding and the word set characteristic weighting is reduced by 0.74%, which indicates that the position coding can enhance the word set position information. When the position code and the word set characteristic weighting are added simultaneously, the word set information can be more fully acquired, and the accuracy of cross-border national culture entity recognition is further improved.

The following is an embodiment of the system of the present invention, and an embodiment of the present invention further provides a cross-border national cultural entity recognition apparatus based on word set feature weighting, which includes an integration module for performing the method of the first aspect. The method specifically comprises the following steps:

the cross-border national culture data preprocessing module: the method is used for labeling and preprocessing the data of the cross-border national culture entity: performing character filtering on an input cross-border national culture sentence, segmenting the sentence into characters and performing character vector representation;

the cross-border national culture text feature representation module integrated with the word set feature information comprises: the method comprises the steps of obtaining a word set through cross-border national culture field dictionary matching, providing a word set characteristic weighting method and position information codes for obtaining word set information, and integrating the word set information into character vector representation;

a cross-border national culture entity recognition model training module based on word set characteristic weighting; the system is used for extracting the characteristics of the context of the cross-border national sentences by using the thought of a bidirectional gating circulation unit and performing entity recognition model training based on word set characteristic weighting by adopting optimal entity label probability calculation;

the cross-border ethnic culture entity recognition module: the cross-border national culture entity recognition method is used for performing cross-border national culture entity recognition after data preprocessing is performed on input texts by using a trained cross-border national culture entity recognition model.

In one possible implementation, the cross-border ethnic culture entity identification module is further configured to: and deploying the trained model to a local server side, converting the model into an application interface by the aid of a Sanic technology, directly calling the model through a webpage side, and outputting a predicted entity to a front-end interface for display.

In one possible implementation, the cross-border ethnic culture data preprocessing module is further configured to:

acquiring cross-border national culture data through a cross-border national culture website, performing de-duplication and special character filtering pretreatment on the data, and manually marking 15717 cross-border national culture sentences with entity labels, wherein the entities in the fields comprise positions, festivals, diet, customs, literature and buildings;

performing duplicate removal and special character filtering on the cross-border national culture data to construct a cross-border national culture field dictionary so as to obtain word set information later, training the cross-border national culture data acquired on the network by combining field words to obtain word vectors and construct a field dictionary, wherein the field dictionary contains words related to festivals, buildings, customs, diet, positions and literary and artistic activities in the cross-border national culture;

performing character vector representation on the cross-border national culture text by adopting a pre-training language model, processing the characters, and then inputting the characters into a Transformer Encoder layer to obtain vector representation of each character of the input text; and a final output of the Transformer Encoder is obtained through a series of normalization and linear processing, so that the dynamic generation of character vectors in the cross-border national culture text is realized.

While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims

1. The cross-border ethnic culture entity recognition method based on word set feature weighting is characterized by comprising the following steps of: the method comprises the following specific steps:

2. The method for recognizing the trans-border national culture entity based on the word set feature weighting as claimed in claim 1, wherein: the specific steps of Step1 are as follows:

step1.1, acquiring cross-border national culture data through a cross-border national culture website, performing de-duplication and special character filtering pretreatment on the data, and then manually marking 15717 cross-border national culture sentences with entity labels, wherein the entities in the fields comprise positions, festivals, diet, customs, literature and buildings;

step1.2, performing duplicate removal and special character filtering on the cross-border national culture data to construct a cross-border national culture field dictionary so as to obtain word set information later, training the cross-border national culture data obtained on the network by combining field words to obtain word vectors and construct a field dictionary, wherein the field dictionary contains words related to festivals, buildings, customs, diet, positions and literature and art in the cross-border national culture;

step1.3, performing character vector representation on the cross-border national culture text by adopting a pre-training language model, processing the characters, and then inputting the characters into a Transformer Encoder layer to obtain vector representation of each character of the input text; and a final output of the Transformer Encoder is obtained through a series of normalization and linear processing, so that the dynamic generation of character vectors in the cross-border national culture text is realized.

3. The method for recognizing the trans-border national culture entity based on the word set feature weighting as claimed in claim 1, wherein: the specific steps of Step2 are as follows:

step2.1, the word set is that all possibly matched words are obtained from a dictionary through cross-border national culture characters, four word sets are formed according to the positions of the characters, and the words are divided into four word sets according to different positions of the characters in the matched words: the character is located at the head part (B) of the word, the character is located in the interior (M) of the word, the character is located at the tail part (E) of the word, and a single character (S) is marked by four labels;

cross border national culture sentence S ═ { c ═ c₁,c₂，…,c_n}∈V_cCharacter c in_iThe four position type word set matching rules of the matched words are as follows:

wherein, V_wRepresenting a pre-constructed domain dictionary, w representing words existing in the domain dictionary, i representing positions of characters, j, k representing positions on two sides of the characters, and n representing the number of the characters in a sentence;

step2.2, performing vectorization feature representation on the word set: by counting the word frequency of each word in the data set, because the word frequency represents the importance degree of the word, the word frequency is assigned to four types of word vectors by using a weighting method:

wherein, z (w)_i) Is the word w_iWord frequency, e (w), counted in the data set_i) Is the word w_iL represents one of the four word set types B, M, E, S, v_i(L) is a word set vector;

step2.3, in order to fully consider the importance degree among the four word set vectors, acquiring the importance degree among the word set vectors by using a word set characteristic weighting method, so that the important word set vectors can acquire more weights; using the word set vector v obtained by Step2.2_i(L)＝{v_i(B),v_i(M),v_i(E),v_i(S) } obtaining a weight matrix W through neural network training^vThe final weight vector is then output by the Softmax function:

V_i＝W^v[v_i(B)；v_i(M)；v_i(E)；v_i(S)]+b^v,α_i＝Softmax(V_i).

wherein, W^vDimension is a training parameter, b^vFor the offset parameter of the neural network, the Softmax function is normalization operation, and finally a weight vector alpha with a value range of (0,1) is obtained_i；

Step2.4, word set information is merged into the character vector representation: in order to retain as much domain dictionary information as possible, each character vector and the four types of word set vectors corresponding to the character are combined into a feature vector, and the feature vector together form the final representation of the character:

e_i(B,M,E,S)＝[α_i1v_i(B)；α_i2v_i(M)；α_i3v_i(E)；α_i4v_i(S)],

x_i＝[g_i；e_i(B,M,E,S)].

wherein v is_i(L)＝{v_i(B),v_i(M),v_i(E),v_i(S)}，[α_i1,α_i2,α_i3,α_i4]＝α_iIs the weight vector obtained in Step2.3, e_i(B, M, W, S) represents the feature vector for four types of stitching, x_iFeature vectors, g, representing information of the merged set of words_iThe character vectors are character vectors in cross-border national culture texts.

4. The method for recognizing the trans-border national culture entity based on the word set feature weighting as claimed in claim 1, wherein: the specific steps of Step3 are as follows:

step3.1, aiming at the problem of dependence of combined characteristic words in the cross-border national culture text, integrating the cross-border national culture text characteristic vector x of word set information_iThe context dependent feature information is extracted by inputting the context dependent feature information into a reset gate and an update gate in a gated loop unit respectively:

r_i＝σ(W^r·[x_i,h_i-1]),u_i＝σ(W^u·[x_i,h_i-1]),

where σ is the sigmoid activation function, x_iTo incorporate the token vector of the word information, h_i-1Is a hidden state at the previous moment, r_iIs a reset gate u_iIs to update the door, W^r,W^u,

For training parameters, tanh (-) is an activation function, and a long-term dependence feature vector h of the cross-border national culture text context is obtained_i；

And Step3.2, performing optimal label probability calculation on the feature vectors, and predicting the entity labels through a cross-border national culture entity recognition model.

5. Device for recognizing a cross-border cultural entity weighted based on word set features, characterized in that it comprises means for carrying out the method according to any one of claims 1 to 4.