CN113935324A - Cross-border national culture entity identification method and device based on word set feature weighting - Google Patents

Cross-border national culture entity identification method and device based on word set feature weighting Download PDF

Info

Publication number
CN113935324A
CN113935324A CN202111068293.3A CN202111068293A CN113935324A CN 113935324 A CN113935324 A CN 113935324A CN 202111068293 A CN202111068293 A CN 202111068293A CN 113935324 A CN113935324 A CN 113935324A
Authority
CN
China
Prior art keywords
cross
word
border
word set
national culture
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111068293.3A
Other languages
Chinese (zh)
Other versions
CN113935324B (en
Inventor
毛存礼
杨振平
余正涛
高盛祥
黄于欣
郭军军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunming University of Science and Technology
Original Assignee
Kunming University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunming University of Science and Technology filed Critical Kunming University of Science and Technology
Priority to CN202111068293.3A priority Critical patent/CN113935324B/en
Publication of CN113935324A publication Critical patent/CN113935324A/en
Application granted granted Critical
Publication of CN113935324B publication Critical patent/CN113935324B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a cross-border ethnic culture entity recognition method and device based on word set characteristic weighting, and belongs to the technical field of natural language processing. The invention provides a cross-border national culture entity recognition method based on word set characteristic weighting, which aims at the characteristics of cross-border national culture entities and comprises four parts, namely cross-border national culture entity data labeling and data preprocessing, cross-border national culture text characteristic representation of word set characteristic information, cross-border national culture entity recognition models based on word set characteristic weighting and cross-border national culture entity recognition. The cross-border national culture entity recognition device based on word set characteristic weighting is manufactured according to the four parts of functional modules, and entity recognition is carried out on input sentences.

Description

Cross-border national culture entity identification method and device based on word set feature weighting
Technical Field
The invention relates to a cross-border ethnic culture entity recognition method and device based on word set characteristic weighting, and belongs to the technical field of natural language processing.
Background
The information extraction comprises entity identification, relation extraction and event extraction, wherein the entity identification is a basic task in the information extraction, the entity identification needs to determine entity boundaries and classify the entity boundaries into predefined entity types, and the domain knowledge graph can be favorably expanded and information retrieval can be supported by mining cross-border ethnic culture entities. Entities related to cross-border national culture are automatically marked from the Internet by utilizing an entity identification technology, so that the time for researchers to manually extract and process information is shortened. The method integrates the lexical characteristics into the entity recognition model to solve the problem of entity boundary ambiguity in cross-border national culture. The method has the advantages that the word set characteristics are integrated into the entity recognition model, so that a good effect can be achieved, the problem of fuzzy domain word boundaries is relieved, and the representation of text semantic information is enhanced. The cross-border national culture entity is usually formed by combining field vocabularies describing national culture characteristics, a large number of field words exist in cross-border national culture data, for example, the 'Sangjing Bimai' is a nickname of 'splash festival', the field words belong to festival type entities, and an entity recognition method integrating word set information can obtain a good effect.
Disclosure of Invention
The invention provides a cross-border national culture entity recognition method and device based on word set characteristic weighting, which aims to improve entity recognition of fuzzy boundary of cross-border national culture entities and enhance cross-border national culture text representation of integrated word set information.
The technical scheme of the invention is as follows: in a first aspect, a method for recognizing a cross-border national culture entity based on word set feature weighting comprises the following specific steps:
step1, marking and preprocessing the data of the cross-border ethnic culture entity: performing character filtering on an input cross-border national culture sentence, segmenting the sentence into characters and performing character vector representation;
due to the lack of entity data sets in the cross-border national culture field, six types of entity types including diet, festivals, custom and the like are defined by combining the characteristics of a large number of field entities in cross-border national culture data, 15717 cross-border national culture data sets with entity labels are marked in an artificial mode, and the data sets play a good supporting role in entity recognition model training.
Step2, cross-border national culture text feature representation of the feature information of the merged word set: acquiring a word set through cross-border national culture field dictionary matching, and providing a word set characteristic weighting method and position information codes for acquiring word set information and integrating the word set information into character vector representation;
the cross-border national culture entity is usually formed by combining field vocabularies describing national culture characteristics, such as a 'Meng Yong soil pan' in dietary culture, and because word sets comprise word boundaries and word meaning information, corresponding rules are formulated to match with a cross-border national culture field dictionary to obtain four word sets, a word set characteristic weighting method and position information codes are provided for obtaining word set information, and cross-border national culture characteristic semantic information is enhanced.
Step3, training a cross-border national culture entity recognition model based on word set characteristic weighting; extracting the characteristics of the context of the cross-border national sentences by using the thought of a bidirectional gating circulation unit, and performing entity recognition model training based on word set characteristic weighting by adopting optimal entity label probability calculation;
in order to enable the model to obtain the semantic information of the context of the cross-border national culture text, for example, the vector representation of the sentence Dai nationality grass grilled fish as a special food needs to be associated with the context 'citronella grass', aiming at the problem of word dependence of combination characteristics, the idea of a bidirectional gating circulation unit is proposed to be integrated into the invention to extract the characteristics of the cross-border national sentence context, and the entity recognition model based on word set characteristic weighting is trained by adopting optimal entity label probability calculation.
Step4, recognizing cross-border ethnic culture entities: by using the trained cross-border national culture entity recognition model, the cross-border national culture entity recognition is carried out after data preprocessing is carried out on the input text.
As a preferable scheme of the invention, the Step1 comprises the following specific steps:
step1.1, obtain cross-border national culture data through cross-border national culture website, the data carries out preprocessing such as deduplication, filter special character, and every cross-border national culture sentence has all marked corresponding entity label, for example the sentence "Dai nationality has the glutinous rice goods of many unique characteristics: such as fragrant bamboo rice, glutinous rice, rice cake, etc. "the entity of the sentence is marked by manpower as" bamboo rice-diet culture, glutinous rice-diet culture, metrorrhagia-diet culture, thousand-layer rice cake-diet culture ". By utilizing the method, 15717 cross-border national culture sentences with entity labels are artificially labeled, the entity types in the fields comprise positions, festivals, diet, custom, literature and buildings, and the analysis of the entity types is shown in the table 1:
TABLE 1 Cross-border ethnic culture entity type analysis
Figure BDA0003259404180000021
Figure BDA0003259404180000031
The character entity label for segmenting the cross-border national culture sentence has a plurality of specifications, such as labels of 'BIO', 'BMESO' and the like, because most of entities in the cross-border national culture field are composed of combined features, the cross-border national culture sentence is separated from the label through character segmentation, each character is labeled by using a 'BMESO' labeling method, wherein B represents an entity starting position, M represents an entity internal position, E represents an entity ending position, S represents a single entity, and O represents a non-entity. For example, the sentence \36181, the label corresponding to the separated folk of the Dai nationality is 'B-XS E-XS O O O O O O O O O', the B-XS represents the beginning label of the folk of the entity type, the E-XS represents the ending label of the folk of the entity type, and the O represents the non-entity label. The defined cross-border ethnic culture entity tag format is shown in table 2:
TABLE 2 Cross-border ethnic culture entity label format
Entity name Entity type Entity label
Ruili (a Chinese character of' Ruili Position of B-WZ/E-WZ
Water-splashing water-saving device Festival culture B-JR/M-JR/E-JR
Canarium album Food culture B-YS/M-YS/E-YS
\36181 ` Buddha Custom culture B-XS/E-XS
Hand dancing Culture of literature and art B-WY/M-WY/E-WY
Soil palm room Building culture B-JZ/M-JZ/E-JZ
Step1.2, processing the cross-border national culture data to remove weight, filter special characters and the like to construct a cross-border national culture field dictionary so as to obtain word set information later, enhancing sentence semantic information by constructing the cross-border national culture field dictionary, training and constructing the field dictionary by combining the cross-border national culture data acquired on the network by field words, wherein the field dictionary contains words related to festivals, buildings, customs, diet, positions and literature in the cross-border national culture, such as cross-border national culture words like 'Nola dance (literature), Bei river county (position), Ara Da Gong (building), curry crab mango fragrant rice (diet), bath Buddha form (custom), summer festival (festival)'.
Step1.3, performing character vector representation on the cross-border national culture text by adopting a pre-training language model, performing special treatment on the characters, and then inputting the characters into a Transformer Encoder layer to obtain the vector representation of each character of the input text. For example, the text "Dai peacock dancing" is represented as E ═ c after the bitwise addition of three Embedding elements[CLS],cDai nationality's nationality,cFamily of people,cHole(s),cSparrow,cDancing,c[SEP]In which c is[CLS]And c[SEP]A special token vector for text. The final output of the Transformer Encoder can be obtained through a series of normalization and linear processing. The cross-border ethnic culture sentence is regarded as a character sequence S ═ { c1,c2,…,cn}∈VcIn which V iscIs a vocabulary of character level, ciIs shown in lengthFor the ith character in the sentence S of n, the idea of the pre-trained language model is applied to each character c of the cross-border ethnic culture entityiPerforming word vector representation:
Q=ci×WQ,K=ci×WK,V=ci×WV,
Figure BDA0003259404180000041
gi=Attention(Q,K,V).
wherein, WQ,WK,WVRepresenting a weight parameter, dkSoftmax is the normalization operation for the dimension of the input feature vector. The final output of the Transformer Encoder can be obtained through a series of normalization and linear processing, and the dynamic generation of the character vectors in the cross-border national culture text is realized by continuously carrying out the processes on each character in the text.
As a preferable scheme of the invention, the Step2 comprises the following specific steps:
step2.1, cross-border national culture field word set matching method: the word set is obtained by cross-border ethnic culture characters from a dictionary and forms four word sets according to the character positions. The domain dictionary contains word boundary information and cross-border national culture text semantic information, and the boundary information and the semantic information in the matched words can be reserved through character matching. Character ciMatching with a domain dictionary to obtain different words, and dividing the words into four word set types according to different positions of the characters in the matched words: the character is located at the head (B) of the word, the character is located at the inner part (M) of the word, the character is located at the tail (E) of the word, and a single character (S) is marked by four labels. For example, the entity "crisp beef jerky" in the dietary culture, the word set matched by the following formula for the character "cattle" is B ═ { beef, beef jerky }, M ═ crisp beef jerky, crisp beef }, E ═ crisp beef }, S ═ beef }. For example, the entity "pineapple purple rice" in the food culture, the word set matched by the character "rice" through the following formula is B ═ rice }, and M ═ pineapple purple rice }Rice, purple rice, E ═ purple rice, and S ═ rice.
Cross border national culture sentence S ═ { c ═ c1,c2,…,cn}∈VcCharacter c iniFour position type word set matching modes of the matched words are as follows:
Figure BDA0003259404180000042
wherein, VwRepresenting a pre-constructed domain dictionary, w representing words existing in the domain dictionary, i representing positions of characters, j, k representing positions on both sides of the characters, and n representing the number of characters in a sentence.
Step2.2, acquiring a word set vector: by counting the word frequency of each word in the data set, because the word frequency can represent the importance degree of the word, the four types of word vectors are endowed with corresponding word frequencies by using a weighting method. The word frequency of words matched with characters in the cross-border national culture text is fused into word vectors, and the word vectors in each type are spliced to obtain word set vector representation of each type:
Figure BDA0003259404180000051
Figure BDA0003259404180000052
wherein, z (w)i) Is the word wiWord frequency, e (w), counted in the data seti) Is the word wiCorresponding dimension is dwA word vector representation of 50. L represents one of four types of { B, M, E, S }, vi(L) is a word set vector with 1 × d dimensionsw
Step2.3, weighting the feature of the word set to obtain the importance degree among the word set vectors: word set vector vi(L)={vi(B),vi(M),vi(E),vi(S) } is obtained by word vector concatenation in each type, onlyIs a word vector with different weights computed in each type. In order to fully consider the importance degree among the four types of word set vectors, the importance degree among the word set vectors is obtained by using a word set characteristic weighting method, so that more weights can be obtained by the important word set vectors. Using the word set vector v obtained by Step2.2i(L)={vi(B),vi(M),vi(E),vi(S) } obtaining a weight matrix W through neural network trainingvThe final weight vector is then output by the Softmax function:
Vi=Wv[vi(B);vi(M);vi(E);vi(S)]+bv
αi=Softmax(Vi)
wherein, WvDimension of 1 × dwTraining parameters of dw=50,bvThe dimension is offset of 1 × 4, and the Softmax function is a normalization operation. Finally, a weight vector alpha with dimension of 1 multiplied by 4 and value range of (0,1) is obtainedi
Step2.4, position coding enhanced position information: the character positions in the cross-border national culture text contain word boundary information, words matched according to the positions of the characters are different, so that position codes are added into word set vectors, four types of word set vectors are distinguished according to the positions of the characters, the four types of positions are vectorized and expressed by adopting the vectors, and the word set vectors fused with the position codes are expressed as follows:
vi(B)=pi(B)WL+vi(B)
vi(M)=pi(M)WL+vi(M)
vi(E)=pi(E)WL+vi(E)
vi(S)=pi(S)WL+vi(S)
wherein p isi(B)=[1,0,0,0],pi(M)=[0,1,0,0],pi(E)=[0,0,1,0],pi(S)=[0,0,0,1],WLIs a 4 xdwTraining parameters of dw=50。
Step2.5, word set information is merged into the character vector representation: in order to retain as much domain dictionary information as possible, each character vector and the four types of word set vectors corresponding to the character are combined into a feature vector, and the feature vector together form the final representation of the character:
ei(B,M,E,S)=[αi1vi(B);αi2vi(M);αi3vi(E);αi4vi(S)],
xi=[gi;ei(B,M,E,S)].
wherein, [ alpha ] isi1i2i3i4]=αiAs a weight vector, ei(B, M, W, S) represents the feature vector for four types of stitching, xiFeature vectors, g, representing information of the merged set of wordsiIs the character vector in Step1.3.
As a preferable scheme of the invention, the Step3 comprises the following specific steps:
step3.1, aiming at the problem of dependence of combined characteristic words in the cross-border national culture text, the characteristic vector x of the word set information fused into Step2.5iThe information is input into a reset gate and an update gate in a bidirectional gated loop unit (GRU), respectively, and the control information of the reset gate is lost, so that the information content in the past is determined to be forgotten, and the information content is retained and combined with the input of the current time step. When r approaches 0, the previous hidden state is ignored and only the current input is used for reset. The update gate decides how much information to pass to the next state, which allows the model to copy all the information from the previous state to reduce the risk of gradient disappearance. The information of the reset gate and the refresh gate is represented as follows
ri=σ(Wr·[xi,hi-1])
ui=σ(Wu·[xi,hi-1])
Where σ is the sigmoid activation function, xiTo incorporate the token vector of the word information, hi-1At the last momentHidden state riIs a reset gate uiIs to update the door, Wr,WuAre training parameters.
New hidden state hiFrom the previous hidden state hi-1And the current input xiAnd calculating to obtain the target product.
Figure BDA0003259404180000071
Figure BDA0003259404180000072
Wherein,
Figure BDA0003259404180000073
is a training parameter, and tanh (-) is an activation function. Feature vector h obtained based on bidirectional GRU coding layeriThe long-term dependency relationship between the contextual information in the cross-border ethnic culture text is obtained.
Step3.2, considering the dependency relationship among the cross-border national culture entity labels, avoiding the error condition existing in the cross-border national culture entity identification, for example, the unreasonable condition that the entity label of 'splash water festival' is 'B-JR M-JR E-JR', and the entity label 'M-YS' appears behind the 'B-JR' in the training process and diet is taken as an internal label, carrying out optimal label probability calculation on the feature vector, and predicting the entity label through a cross-border national culture entity identification model.
Pi=Wphi+bp,
Figure BDA0003259404180000074
Wherein Wp,bpIs the parameter of the calculated score matrix P, T is a transition matrix, hiThe vector is output for Step3.1.
Using the self-attention mechanism to extract the importance of adjacent feature vectors, enhance useful features andreducing features that are not useful. Feature vector h after bidirectional GRU codingiAnd calculating the corresponding weight of the feature vector by using a self-attention mechanism.
Q=hi×WQ,K=hi×WK,V=hi×WV,
Figure BDA0003259404180000075
headi=Attention(Q,K,V).
Wherein, WQ,WK,WVRepresenting a weight parameter, dkSoftmax is the normalization operation for the dimension of the input feature vector.
Reflecting the relevance and importance degree between the feature vectors through a self-attention mechanism, completing the cross-border national culture entity recognition, giving corresponding weight to all the feature vectors according to the influence because the feature vectors have different influences on the entity recognition, and then obtaining the final output vector headi. The self-attention mechanism can further improve the degree of distinction of importance among the components of the feature vector, thereby being beneficial to the identification of the cross-border national culture entity.
In a second aspect, an embodiment of the present invention further provides a cross-border ethnic cultural entity recognition apparatus weighted based on word set features, which includes modules for performing the method of the first aspect.
The invention has the beneficial effects that:
1. the invention integrates the word set information into the entity recognition model, the word set obtained by the dictionary of the character matching field contains entity boundary information, and the word set is utilized to realize the enhancement of the cross-border national culture text semantic information, so that the model can achieve better effect on the cross-border national culture entity recognition.
2. The method obtains the importance degree among the word set vectors based on the word set characteristic weighting, and utilizes the position coding to enhance the word set position information matched with the characters, so that the characteristics of the word set and the vectors are richer. The word set features are merged into the character representation, so that the problem that entity boundary ambiguity exists in entity recognition based on character representation, and entity recognition errors are caused.
Drawings
FIG. 1 is a word set information diagram based on word set feature weighting in the present invention;
FIG. 2 is a diagram illustrating exemplary word frequency statistics in the present invention;
FIG. 3 is a cross-border national culture entity recognition frame diagram based on word set feature weighting in the present invention;
fig. 4 is an overall flow chart of the present invention.
Detailed Description
Example 1: as shown in fig. 1 to 4, in a first aspect, a method for recognizing a cross-border national culture entity based on word set feature weighting specifically includes the following steps:
step1, marking and preprocessing the data of the cross-border ethnic culture entity: performing character filtering on an input cross-border national culture sentence, segmenting the sentence into characters and performing character vector representation;
due to the lack of entity data sets in the cross-border national culture field, six types of entity types including diet, festivals, custom and the like are defined by combining the characteristics of a large number of field entities in cross-border national culture data, 15717 cross-border national culture data sets with entity labels are marked in an artificial mode, and the data sets play a good supporting role in entity recognition model training.
Step2, cross-border national culture text feature representation of the feature information of the merged word set: acquiring a word set through cross-border national culture field dictionary matching, and providing a word set characteristic weighting method and position information codes for acquiring word set information and integrating the word set information into character vector representation;
the cross-border national culture entity is usually formed by combining field vocabularies describing national culture characteristics, such as a 'Meng Yong soil pan' in dietary culture, and because word sets comprise word boundaries and word meaning information, corresponding rules are formulated to match with a cross-border national culture field dictionary to obtain four word sets, a word set characteristic weighting method and position information codes are provided for obtaining word set information, and cross-border national culture characteristic semantic information is enhanced.
Step3, training a cross-border national culture entity recognition model based on word set characteristic weighting; extracting the characteristics of the context of the cross-border national sentences by using the thought of a bidirectional gating circulation unit, and performing entity recognition model training based on word set characteristic weighting by adopting optimal entity label probability calculation;
in order to enable the model to obtain the semantic information of the context of the cross-border national culture text, for example, the vector representation of the sentence Dai nationality grass grilled fish as a special food needs to be associated with the context 'citronella grass', aiming at the problem of word dependence of combination characteristics, the idea of a bidirectional gating circulation unit is proposed to be integrated into the invention to extract the characteristics of the cross-border national sentence context, and the entity recognition model based on word set characteristic weighting is trained by adopting optimal entity label probability calculation.
Step4, recognizing cross-border ethnic culture entities: by using the trained cross-border national culture entity recognition model, the cross-border national culture entity recognition is carried out after data preprocessing is carried out on the input text.
As a preferable scheme of the invention, the Step1 comprises the following specific steps:
step1.1, acquiring related cross-border national culture data on a cross-border national culture website, and manually marking 15717 cross-border national culture sentences with entity labels, wherein entity types are defined as 6 types: location, holiday culture, diet culture, custom culture, literary and artistic culture, and architectural culture; segmenting characters and corresponding labels in the cross-border national culture sentences to enable each character to correspond to one label, wherein the format of the corresponding entity label is shown in a table 3:
TABLE 3 Cross-border ethnic culture entity label format
Entity name Entity type Entity label
Ruili (a Chinese character of' Ruili Position of B-WZ/E-WZ
Water-splashing water-saving device Festival culture B-JR/M-JR/E-JR
Canarium album Food culture B-YS/M-YS/E-YS
\36181 ` Buddha Custom culture B-XS/E-XS
Hand dancing Culture of literature and art B-WY/M-WY/E-WY
Soil palm room Building culture B-JZ/M-JZ/E-JZ
Step1.2, sentence semantic information is enhanced by constructing a cross-border national culture field dictionary, the field dictionary is trained and constructed through cross-border national culture data acquired on the network in combination with field words, and the field dictionary contains words related to festivals, buildings, customs, diet, positions and literary arts in cross-border national culture, such as cross-border national culture words of 'Nola dance (literary composition), North river county (position), southeast Dagong (building), curry crab mango rice (diet), bathing Buddha ceremony (custom), summer festival (festival)'.
Step1.3, performing character vector representation on the cross-border national culture text by adopting a pre-training language model, performing special treatment on the characters, and then inputting the characters into a Transformer Encoder layer to obtain the vector representation of each character of the input text. For example, the text "Dai peacock dancing" is represented as E ═ c after the bitwise addition of three Embedding elements[CLS],cDai nationality's nationality,cFamily of people,cHole(s),cSparrow,cDancing,c[SEP]In which c is[CLS]And c[SEP]A special token vector for text. The final output of the Transformer Encoder can be obtained through a series of normalization and linear processing. The cross-border ethnic culture sentence is regarded as a character sequence S ═ { c1,c2,…,cn}∈VcIn which V iscIs a vocabulary of character level, ciRepresenting the ith character in a sentence S of length n, the idea of the pre-trained language model for each character c of the cross-border cultural ethnic entityiPerforming word vector representation:
Q=ci×WQ,K=ci×WK,V=ci×WV,
Figure BDA0003259404180000101
gi=Attention(Q,K,V).
wherein, WQ,WK,WVRepresenting a weight parameter, dkSoftmax is the normalization operation for the dimension of the input feature vector. The final output of the Transformer Encoder can be obtained through a series of normalization and linear processing, and the dynamic generation of the character vectors in the cross-border national culture text is realized by continuously carrying out the processes on each character in the text.
As a preferable scheme of the invention, the Step2 comprises the following specific steps:
step2.1, matching word sets in the cross-border national culture field: the word set is obtained by cross-border ethnic culture characters from a dictionary and forms four word sets according to the character positions. The domain dictionary contains word boundary information and cross-border national culture text semantic information, and the boundary information and the semantic information in the matched words can be reserved through character matching. Character ciMatching with a domain dictionary to obtain different words, and dividing the words into four word set types according to different positions of the characters in the matched words: the character is located at the head (B) of the word, the character is located at the inner part (M) of the word, the character is located at the tail (E) of the word, and a single character (S) is marked by four labels.
Cross border national culture sentence S ═ { c ═ c1,c2,…,cn}∈VcCharacter c iniFour position type word set matching modes of the matched words are as follows:
Figure BDA0003259404180000102
wherein VwRepresenting a pre-constructed domain dictionary, w representing words existing in the domain dictionary, i representing positions of characters, j, k representing positions on both sides of the characters, and n representing the number of characters in a sentence.
Step2.2, acquiring a word set vector: as shown in fig. 2, the word frequency of the matching word is counted, and the word frequency of the matching word is merged into the word vector because the word frequency can represent the importance degree of the word, and the four types of word vectors are assigned with corresponding word frequencies by using a weighting method. The word frequency of words matched with characters in the cross-border national culture text is fused into word vectors, and the word vectors in each type are spliced to obtain word set vector representation of each type:
Figure BDA0003259404180000111
Figure BDA0003259404180000112
wherein, z (w)i) Is the word wiWord frequency, e (w), counted in the data seti) Is the word wiCorresponding dimension is dwA word vector representation of 50. L represents one of four types of { B, M, E, S }, vi(L) is a word set vector with 1 × d dimensionsw
Step2.3, weighting the feature of the word set to obtain the importance degree among the word set vectors: word set vector vi(L) is obtained by word vector concatenation in each type, only word vectors with different weights in each type are calculated. In order to fully consider the importance degree among the four types of word set vectors, the importance degree among the word set vectors is obtained by using a word set characteristic weighting method, and a weight matrix W is obtained through neural network trainingvThe final weight vector is then output by the Softmax function:
Vi=Wv[vi(B);vi(M);vi(E);vi(S)]+bv
αi=Softmax(Vi)
wherein, WvDimension of 1 × dwTraining parameters of dw=50,bvThe dimension is offset of 1 × 4, and the Softmax function is a normalization operation. Finally, a weight vector alpha with dimension of 1 multiplied by 4 and value range of (0,1) is obtainedi
Step2.4, position coding enhanced position information: the character positions in the cross-border national culture text contain word boundary information, words matched according to the positions of the characters are different, so that position codes are added into word set vectors, four types of word set vectors are distinguished according to the positions of the characters, the four types of positions are vectorized and expressed by adopting the vectors, and the word set vectors fused with the position codes are expressed as follows:
vi(B)=pi(B)WL+vi(B)
vi(M)=pi(M)WL+vi(M)
vi(E)=pi(E)WL+vi(E)
vi(S)=pi(S)WL+vi(S)
wherein p isi(B)=[1,0,0,0],pi(M)=[0,1,0,0],pi(E)=[0,0,1,0],pi(S)=[0,0,0,1],WLIs a 4 xdwTraining parameters of dw=50。
Step2.5, word set information is merged into the character vector representation: in order to retain as much domain dictionary information as possible, each character vector and the four types of word set vectors corresponding to the character are combined into a feature vector, and the feature vector together form the final representation of the character:
ei(B,M,E,S)=[αi1vi(B);αi2vi(M);αi3vi(E);αi4vi(S)],
xi=[gi;ei(B,M,E,S)].
wherein, [ alpha ] isi1i2i3i4]=αiAs a weight vector, ei(B, M, W, S) represents the feature vector for four types of stitching, xiFeature vectors, g, representing information of the merged set of wordsiIs a character vector.
As a preferable scheme of the invention, the Step3 comprises the following specific steps:
step3.1, utilizing bidirectional GRU to extract the characteristics of the vector representation of the cross-border national culture integrated word set information, and integrating the characteristic vector x of the word set informationiThe information of the reset gate is lost, which determines how much information content needs to be forgotten, and how much information content is reserved to be combined with the input of the current time step. When r approaches 0, the previous hidden state is ignored and only the current input is used for reset. The update gate decides how much information to pass to the next state, which allows the model to copy all the information from the previous state to reduce the risk of gradient disappearance.The information of the reset gate and the refresh gate is represented as follows
ri=σ(Wr·[xi,hi-1])
ui=σ(Wu·[xi,hi-1])
Where σ is the sigmoid activation function, xiTo incorporate the token vector of the word information, hi-1Is the hidden state at the previous moment, riIs a reset gate uiIs to update the door, Wr,WuAre training parameters.
In a bidirectional GRU, a new hidden state hiFrom the previous hidden state hi-1And the current input xiAnd calculating to obtain the target product.
Figure BDA0003259404180000121
Figure BDA0003259404180000122
Wherein,
Figure BDA0003259404180000123
is a training parameter, and tanh (-) is an activation function. Feature vector h obtained based on bidirectional GRU coding layeriThe long-term dependency relationship between the contextual information in the cross-border ethnic culture text is obtained.
Step3.2, the degree of importance of the self-attention mechanism for extracting adjacent feature vectors, enhances useful features and reduces less useful features. Feature vector h after bidirectional GRU codingiThe feature vector weights are calculated using the self-attention mechanism:
Q=hi×WQ,K=hi×WK,V=hi×WV,
Figure BDA0003259404180000131
headi=Attention(Q,K,V).
wherein, WQ,WK,WVRepresenting a weight parameter, dkThe dimension of the input feature vector is 50, and Softmax is the normalization operation.
As a preferable scheme of the invention, the Step4 comprises the following specific steps:
step4.1, through the idea of global optimization, a global optimal tag sequence is obtained by considering the dependency relationship among tags, so that some error conditions are prevented, for example, unreasonable conditions such as access of a 'diet' after a 'holiday' of a tag occur.
By the character s ═ c in cross-border ethnic culture text1,c2,…,cn}∈VcCorresponding predicted tag sequence y ═ y1,y2,…,ynCalculating the probability:
Pi=Wpheadi+bp,
Figure BDA0003259404180000132
wherein Wp,bpIs the parameter of the calculated score matrix P, T is a transition matrix, headiAnd for the output vector of Step3.2, predicting the globally optimal label sequence by adopting a Viterbi algorithm in the final decoding stage of label prediction.
In order to illustrate the effect of the invention, the invention carries out the following comparative experiments, and the adopted experimental data are all national culture artificial labeling data sets.
The evaluation index used was to evaluate the model by Precision (Precision), Recall (Recall) and F1 values. The calculation methods of the accuracy, recall and F1 values are as follows.
Figure BDA0003259404180000133
Figure BDA0003259404180000134
Figure BDA0003259404180000135
In order to verify the effect of the cross-border national culture entity recognition model based on word set feature weighting, the following comparative test is designed for analysis. Compared with Bi-LSTM, Lattice-LSTM, LR-CNN, FLAT and SoftLexicon (LSTM) entity identification methods, the specific experimental results are shown in Table 4.
TABLE 4 comparative experiments of different methods
Name of method P(%) R(%) F1(%)
Bi-LSTM+CRF 83.59 91.52 87.38
Lattice-LSTM 89.08 92.52 90.76
LR-CNN 92.81 90.15 91.46
FLAT 92.76 95.05 93.89
SoftLexicon(LSTM) 90.68 93.39 92.01
The method of the invention 95.56 94.01 94.72
Experiments show that compared with Bi-LSTM + CRF models, the method utilizes word set information to enhance text context semantic information, compared with Lattice-LSTM, LR-CNN, FLAT and SoftLexicon (LSTM) models, the method is integrated with a cross-border national culture field dictionary, and position coding is adopted to enhance word set position information, so that word sets matched through characters are more complete.
And table 5 shows that the cross-border national culture entity recognition method adopting word set characteristic weighting is adopted to carry out model training and respectively fuse the influence of position coding, word set characteristic weighting and position coding on the experimental result to carry out effect comparison.
TABLE 5 influence of word set feature weighting and position coding on the model
P(%) R(%) F1(%)
Integrating position coding 94.72 93.25 93.98
Word set feature based weighting 94.15 92.39 93.26
Merging position coding + word set feature weighting 95.56 94.01 94.72
The experimental result shows that the integration of different coding information can affect the experimental result, when only the position coding is integrated in the model, the F1 value integrated in the model by the position coding and the word set characteristic weighting is reduced by 1.46%, the importance degree between four word set vectors is verified to be helped to be distinguished by the word set characteristic weighting, and when only the word set characteristic weighting is integrated in the model, the F1 value integrated in the model by the position coding and the word set characteristic weighting is reduced by 0.74%, which indicates that the position coding can enhance the word set position information. When the position code and the word set characteristic weighting are added simultaneously, the word set information can be more fully acquired, and the accuracy of cross-border national culture entity recognition is further improved.
The following is an embodiment of the system of the present invention, and an embodiment of the present invention further provides a cross-border national cultural entity recognition apparatus based on word set feature weighting, which includes an integration module for performing the method of the first aspect. The method specifically comprises the following steps:
the cross-border national culture data preprocessing module: the method is used for labeling and preprocessing the data of the cross-border national culture entity: performing character filtering on an input cross-border national culture sentence, segmenting the sentence into characters and performing character vector representation;
the cross-border national culture text feature representation module integrated with the word set feature information comprises: the method comprises the steps of obtaining a word set through cross-border national culture field dictionary matching, providing a word set characteristic weighting method and position information codes for obtaining word set information, and integrating the word set information into character vector representation;
a cross-border national culture entity recognition model training module based on word set characteristic weighting; the system is used for extracting the characteristics of the context of the cross-border national sentences by using the thought of a bidirectional gating circulation unit and performing entity recognition model training based on word set characteristic weighting by adopting optimal entity label probability calculation;
the cross-border ethnic culture entity recognition module: the cross-border national culture entity recognition method is used for performing cross-border national culture entity recognition after data preprocessing is performed on input texts by using a trained cross-border national culture entity recognition model.
In one possible implementation, the cross-border ethnic culture entity identification module is further configured to: and deploying the trained model to a local server side, converting the model into an application interface by the aid of a Sanic technology, directly calling the model through a webpage side, and outputting a predicted entity to a front-end interface for display.
In one possible implementation, the cross-border ethnic culture data preprocessing module is further configured to:
acquiring cross-border national culture data through a cross-border national culture website, performing de-duplication and special character filtering pretreatment on the data, and manually marking 15717 cross-border national culture sentences with entity labels, wherein the entities in the fields comprise positions, festivals, diet, customs, literature and buildings;
performing duplicate removal and special character filtering on the cross-border national culture data to construct a cross-border national culture field dictionary so as to obtain word set information later, training the cross-border national culture data acquired on the network by combining field words to obtain word vectors and construct a field dictionary, wherein the field dictionary contains words related to festivals, buildings, customs, diet, positions and literary and artistic activities in the cross-border national culture;
performing character vector representation on the cross-border national culture text by adopting a pre-training language model, processing the characters, and then inputting the characters into a Transformer Encoder layer to obtain vector representation of each character of the input text; and a final output of the Transformer Encoder is obtained through a series of normalization and linear processing, so that the dynamic generation of character vectors in the cross-border national culture text is realized.
While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims (5)

1. The cross-border ethnic culture entity recognition method based on word set feature weighting is characterized by comprising the following steps of: the method comprises the following specific steps:
step1, marking and preprocessing the data of the cross-border ethnic culture entity: performing character filtering on an input cross-border national culture sentence, segmenting the sentence into characters and performing character vector representation;
step2, cross-border national culture text feature representation of the feature information of the merged word set: acquiring a word set through cross-border national culture field dictionary matching, and providing a word set characteristic weighting method and position information codes for acquiring word set information and integrating the word set information into character vector representation;
step3, training a cross-border national culture entity recognition model based on word set characteristic weighting; extracting the characteristics of the context of the cross-border national sentences by using the thought of a bidirectional gating circulation unit, and performing entity recognition model training based on word set characteristic weighting by adopting optimal entity label probability calculation;
step4, recognizing cross-border ethnic culture entities: by using the trained cross-border national culture entity recognition model, the cross-border national culture entity recognition is carried out after data preprocessing is carried out on the input text.
2. The method for recognizing the trans-border national culture entity based on the word set feature weighting as claimed in claim 1, wherein: the specific steps of Step1 are as follows:
step1.1, acquiring cross-border national culture data through a cross-border national culture website, performing de-duplication and special character filtering pretreatment on the data, and then manually marking 15717 cross-border national culture sentences with entity labels, wherein the entities in the fields comprise positions, festivals, diet, customs, literature and buildings;
step1.2, performing duplicate removal and special character filtering on the cross-border national culture data to construct a cross-border national culture field dictionary so as to obtain word set information later, training the cross-border national culture data obtained on the network by combining field words to obtain word vectors and construct a field dictionary, wherein the field dictionary contains words related to festivals, buildings, customs, diet, positions and literature and art in the cross-border national culture;
step1.3, performing character vector representation on the cross-border national culture text by adopting a pre-training language model, processing the characters, and then inputting the characters into a Transformer Encoder layer to obtain vector representation of each character of the input text; and a final output of the Transformer Encoder is obtained through a series of normalization and linear processing, so that the dynamic generation of character vectors in the cross-border national culture text is realized.
3. The method for recognizing the trans-border national culture entity based on the word set feature weighting as claimed in claim 1, wherein: the specific steps of Step2 are as follows:
step2.1, the word set is that all possibly matched words are obtained from a dictionary through cross-border national culture characters, four word sets are formed according to the positions of the characters, and the words are divided into four word sets according to different positions of the characters in the matched words: the character is located at the head part (B) of the word, the character is located in the interior (M) of the word, the character is located at the tail part (E) of the word, and a single character (S) is marked by four labels;
cross border national culture sentence S ═ { c ═ c1,c2,…,cn}∈VcCharacter c iniThe four position type word set matching rules of the matched words are as follows:
Figure FDA0003259404170000021
wherein, VwRepresenting a pre-constructed domain dictionary, w representing words existing in the domain dictionary, i representing positions of characters, j, k representing positions on two sides of the characters, and n representing the number of the characters in a sentence;
step2.2, performing vectorization feature representation on the word set: by counting the word frequency of each word in the data set, because the word frequency represents the importance degree of the word, the word frequency is assigned to four types of word vectors by using a weighting method:
Figure FDA0003259404170000022
wherein, z (w)i) Is the word wiWord frequency, e (w), counted in the data seti) Is the word wiL represents one of the four word set types B, M, E, S, vi(L) is a word set vector;
step2.3, in order to fully consider the importance degree among the four word set vectors, acquiring the importance degree among the word set vectors by using a word set characteristic weighting method, so that the important word set vectors can acquire more weights; using the word set vector v obtained by Step2.2i(L)={vi(B),vi(M),vi(E),vi(S) } obtaining a weight matrix W through neural network trainingvThe final weight vector is then output by the Softmax function:
Vi=Wv[vi(B);vi(M);vi(E);vi(S)]+bvi=Softmax(Vi).
wherein, WvDimension is a training parameter, bvFor the offset parameter of the neural network, the Softmax function is normalization operation, and finally a weight vector alpha with a value range of (0,1) is obtainedi
Step2.4, word set information is merged into the character vector representation: in order to retain as much domain dictionary information as possible, each character vector and the four types of word set vectors corresponding to the character are combined into a feature vector, and the feature vector together form the final representation of the character:
ei(B,M,E,S)=[αi1vi(B);αi2vi(M);αi3vi(E);αi4vi(S)],
xi=[gi;ei(B,M,E,S)].
wherein v isi(L)={vi(B),vi(M),vi(E),vi(S)},[αi1i2i3i4]=αiIs the weight vector obtained in Step2.3, ei(B, M, W, S) represents the feature vector for four types of stitching, xiFeature vectors, g, representing information of the merged set of wordsiThe character vectors are character vectors in cross-border national culture texts.
4. The method for recognizing the trans-border national culture entity based on the word set feature weighting as claimed in claim 1, wherein: the specific steps of Step3 are as follows:
step3.1, aiming at the problem of dependence of combined characteristic words in the cross-border national culture text, integrating the cross-border national culture text characteristic vector x of word set informationiThe context dependent feature information is extracted by inputting the context dependent feature information into a reset gate and an update gate in a gated loop unit respectively:
ri=σ(Wr·[xi,hi-1]),ui=σ(Wu·[xi,hi-1]),
Figure FDA0003259404170000031
where σ is the sigmoid activation function, xiTo incorporate the token vector of the word information, hi-1Is a hidden state at the previous moment, riIs a reset gate uiIs to update the door, Wr,Wu,
Figure FDA0003259404170000032
For training parameters, tanh (-) is an activation function, and a long-term dependence feature vector h of the cross-border national culture text context is obtainedi
And Step3.2, performing optimal label probability calculation on the feature vectors, and predicting the entity labels through a cross-border national culture entity recognition model.
5. Device for recognizing a cross-border cultural entity weighted based on word set features, characterized in that it comprises means for carrying out the method according to any one of claims 1 to 4.
CN202111068293.3A 2021-09-13 2021-09-13 Cross-border national culture entity identification method and device based on word set feature weighting Active CN113935324B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111068293.3A CN113935324B (en) 2021-09-13 2021-09-13 Cross-border national culture entity identification method and device based on word set feature weighting

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111068293.3A CN113935324B (en) 2021-09-13 2021-09-13 Cross-border national culture entity identification method and device based on word set feature weighting

Publications (2)

Publication Number Publication Date
CN113935324A true CN113935324A (en) 2022-01-14
CN113935324B CN113935324B (en) 2022-10-28

Family

ID=79275641

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111068293.3A Active CN113935324B (en) 2021-09-13 2021-09-13 Cross-border national culture entity identification method and device based on word set feature weighting

Country Status (1)

Country Link
CN (1) CN113935324B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114580422A (en) * 2022-03-14 2022-06-03 昆明理工大学 Named entity identification method combining two-stage classification of neighbor analysis
CN114970537A (en) * 2022-06-27 2022-08-30 昆明理工大学 Cross-border ethnic culture entity relationship extraction method and device based on multilayer labeling strategy

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112101028A (en) * 2020-08-17 2020-12-18 淮阴工学院 Multi-feature bidirectional gating field expert entity extraction method and system
EP3767516A1 (en) * 2019-07-18 2021-01-20 Ricoh Company, Ltd. Named entity recognition method, apparatus, and computer-readable recording medium
CN113033206A (en) * 2021-04-01 2021-06-25 重庆交通大学 Bridge detection field text entity identification method based on machine reading understanding
CN113128232A (en) * 2021-05-11 2021-07-16 济南大学 Named entity recognition method based on ALBERT and multi-word information embedding

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3767516A1 (en) * 2019-07-18 2021-01-20 Ricoh Company, Ltd. Named entity recognition method, apparatus, and computer-readable recording medium
CN112101028A (en) * 2020-08-17 2020-12-18 淮阴工学院 Multi-feature bidirectional gating field expert entity extraction method and system
CN113033206A (en) * 2021-04-01 2021-06-25 重庆交通大学 Bridge detection field text entity identification method based on machine reading understanding
CN113128232A (en) * 2021-05-11 2021-07-16 济南大学 Named entity recognition method based on ALBERT and multi-word information embedding

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LIU JIE等: "A Novel Dual Pointer Approach for Entity Mention Extraction", 《CHINESE JOURNAL OF ELECTRONICS》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114580422A (en) * 2022-03-14 2022-06-03 昆明理工大学 Named entity identification method combining two-stage classification of neighbor analysis
CN114970537A (en) * 2022-06-27 2022-08-30 昆明理工大学 Cross-border ethnic culture entity relationship extraction method and device based on multilayer labeling strategy
CN114970537B (en) * 2022-06-27 2024-04-23 昆明理工大学 Cross-border ethnic cultural entity relation extraction method and device based on multi-layer labeling strategy

Also Published As

Publication number Publication date
CN113935324B (en) 2022-10-28

Similar Documents

Publication Publication Date Title
CN111444726B (en) Chinese semantic information extraction method and device based on long-short-term memory network of bidirectional lattice structure
CN109960800B (en) Weak supervision text classification method and device based on active learning
CN110377903B (en) Sentence-level entity and relation combined extraction method
CN109753660B (en) LSTM-based winning bid web page named entity extraction method
US20230195773A1 (en) Text classification method, apparatus and computer-readable storage medium
Silberer et al. Visually grounded meaning representations
CN106383816B (en) The recognition methods of Chinese minority area place name based on deep learning
CN109684642B (en) Abstract extraction method combining page parsing rule and NLP text vectorization
CN112257449B (en) Named entity recognition method and device, computer equipment and storage medium
CN110222163A (en) A kind of intelligent answer method and system merging CNN and two-way LSTM
CN107909115B (en) Image Chinese subtitle generating method
Tran et al. Understanding what the users say in chatbots: A case study for the Vietnamese language
CN113935324B (en) Cross-border national culture entity identification method and device based on word set feature weighting
WO2021042516A1 (en) Named-entity recognition method and device, and computer readable storage medium
CN110765769B (en) Clause feature-based entity attribute dependency emotion analysis method
CN113392209B (en) Text clustering method based on artificial intelligence, related equipment and storage medium
CN110555084A (en) remote supervision relation classification method based on PCNN and multi-layer attention
CN114818717B (en) Chinese named entity recognition method and system integrating vocabulary and syntax information
CN110852089B (en) Operation and maintenance project management method based on intelligent word segmentation and deep learning
CN110569506A (en) Medical named entity recognition method based on medical dictionary
CN108509423A (en) A kind of acceptance of the bid webpage name entity abstracting method based on second order HMM
CN111666752A (en) Circuit teaching material entity relation extraction method based on keyword attention mechanism
CN111241820A (en) Bad phrase recognition method, device, electronic device, and storage medium
Wang et al. Sex trafficking detection with ordinal regression neural networks
CN116842168B (en) Cross-domain problem processing method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant