CN114781380A - Chinese named entity recognition method, equipment and medium fusing multi-granularity information - Google Patents

Chinese named entity recognition method, equipment and medium fusing multi-granularity information Download PDF

Info

Publication number
CN114781380A
CN114781380A CN202210277553.6A CN202210277553A CN114781380A CN 114781380 A CN114781380 A CN 114781380A CN 202210277553 A CN202210277553 A CN 202210277553A CN 114781380 A CN114781380 A CN 114781380A
Authority
CN
China
Prior art keywords
characters
word
sequence
named entity
character
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210277553.6A
Other languages
Chinese (zh)
Inventor
李丽洁
胡双阳
韩启龙
宋洪涛
王也
马志强
张海涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Engineering University
Original Assignee
Harbin Engineering University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Engineering University filed Critical Harbin Engineering University
Priority to CN202210277553.6A priority Critical patent/CN114781380A/en
Publication of CN114781380A publication Critical patent/CN114781380A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24133Distances to prototypes
    • G06F18/24137Distances to cluster centroïds
    • G06F18/2414Smoothing the distance, e.g. radial basis function networks [RBFN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biophysics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Molecular Biology (AREA)
  • Computational Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Operations Research (AREA)
  • Algebra (AREA)
  • Animal Behavior & Ethology (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a Chinese named entity recognition method, equipment and medium fusing multi-granularity information. The method comprises the following steps: (1) acquiring a field corpus data set, preprocessing the data set, and dividing the data set into a training set, a test set and a verification set; (2) extracting and fusing characters, soft words and radical pre-training vectors in the preprocessed corpus data in the step (1); (3) constructing a Chinese named entity recognition model fusing multi-granularity information; (4) inputting the data obtained in the step (2) into a model for training; (5) and (5) processing and calculating the data to be recognized by using the recognition model obtained in the step (4) to obtain a named entity recognition result. Aiming at the defects existing in the Chinese named entity recognition, the method obtains word-level semantic information by fusing the internal inherent semantic information of characters in a sequence and utilizing an expanded soft word module through fusing radical-level information, and fuses the word-level semantic information and the word-level semantic information into a character embedding vector, thereby improving the precision of the Chinese named entity recognition.

Description

Chinese named entity recognition method, equipment and medium fusing multi-granularity information
Technical Field
The invention belongs to the technical field of named entity recognition, and particularly relates to a method, equipment and medium for recognizing a Chinese named entity by fusing multi-granularity information.
Background
With the continuous development of economic level and computer technology, a large amount of texts emerge from the network at every moment. These texts contain information in various aspects such as society, economy, life, science and technology, but because of the large amount and variety of the texts, the information contained in the texts is often difficult to be effectively utilized.
Named entity recognition solves the problem to a certain extent, and the core tasks are as follows: for a given text, named entities in sentences with names of people, places, organization names, etc. that contain key information are identified and refined. Named entity recognition is a fundamental task in the field of natural language processing, many downstream tasks such as: a question-answering system, knowledge graph construction, information extraction and the like are all independent of named entity identification. Because Chinese does not have obvious word boundaries such as spaces and the like, characters are closely arranged, and the difficulty of Chinese named entity recognition is generally higher than that of English named entity recognition. Named entity recognition is generally regarded as a sequence labeling problem, and most of traditional named entity recognition models are linear statistical models, such as hidden markov models, maximum entropy hidden markov models, conditional random fields and support vector machines. In recent years, the deep learning method has achieved a good effect in the task of named entity recognition by virtue of its powerful ability, and is gradually becoming the mainstream method of named entity recognition. Existing chinese named entity recognition models are classified into character-based models and word-based models. In the word-based model, a Chinese word segmentation system is needed first, and the input sequence is used as the input of the model after being subjected to word segmentation processing. However, due to the complexity of chinese, the segmentation system cannot avoid segmentation errors, and such segmentation errors can be continuously propagated to the end of the sequence, resulting in poor model effect. Character-based models, while avoiding this problem, make difficult to utilize word information in the sequence. A named entity is typically composed of one or more consecutive words, with word boundaries often coinciding with entity boundaries. The word information in the sequence is effectively utilized, and the performance of named entity recognition can be greatly improved. To incorporate word information into character-based models, many scholars attempt to match potential words in the sequence using an external dictionary. Typically, Zhang and Yang et al propose lattice models that incorporate the weights of potential word information in sequences that matches a dictionary into a character-based model using a gating mechanism. After that, Ma et al propose a simplified lattice model for the problem that the lattice model directed acyclic graph structure cannot be trained in batch, which is inefficient, and the lattice model is degraded. The simplified lattice model uses a soft word strategy and a weight fusion mechanism to replace a gating mechanism of the lattice model, so that the problem of model degradation is effectively avoided, the sentence length is fixed, and the efficiency is also greatly improved. Although the simplified lattice model directly and effectively utilizes word information, the following problems still exist: on the one hand, the soft-word method loses some of the information in the middle group for longer words, such as: for the input sequence: in the Chinese football team, the intermediate word dictionary candidate words corresponding to the ball word are the Chinese football team, the Chinese football team and the football team (the ball word is in the middle position in the three words). The soft word method does not distinguish the three words into the middle group and does not distinguish the specific positions of the three words, and the relative position information is very important for identifying the named entity. This problem becomes more severe as the physical length increases. On the other hand, the radical-level semantic information inside the characters in the sequence is not explored and utilized, the pre-training language model can effectively capture the context semantic information in the sequence, but cannot obtain the inherent semantic information contained in the character pictograph, and the semantic information is specifically represented as follows: radicals, character construction and writing sequence. In summary, the main problems of the current research work are that the model is susceptible to word segmentation errors or insufficient utilization of word information, and the radical-level semantic information in the characters is not considered, the lattice series model fails to fully utilize the semantic information of three different granularities, i.e., the word, the character and the radical, in the sequence, and the recognition accuracy still needs to be improved.
Disclosure of Invention
The invention aims to solve the problems that the traditional Chinese named entity identification method is difficult to fully utilize information in a sequence and has poor identification effect, and provides a Chinese named entity identification method, equipment and medium fusing multi-granularity information.
The invention is realized by the following technical scheme, and provides a Chinese named entity identification method fusing multi-granularity information, which specifically comprises the following steps:
step 1: acquiring a field corpus data set, preprocessing the data set, and dividing the data set into a training set, a test set and a verification set;
step 2: extracting characters, soft words and radical pre-training vectors in the corpus data preprocessed in the step 1 for vector fusion, and constructing a Chinese named entity recognition model fusing multi-granularity information;
and step 3: inputting the data obtained in the step (2) into a model for training;
and 4, step 4: and (4) processing and calculating the data to be recognized by using the Chinese named entity recognition model fused with the multi-granularity information obtained in the step (3) to obtain a named entity recognition result.
Further, the step 1 specifically includes the following steps:
step 1.1: identifying named entities in sentence-level corpus data and marking the named entities as predefined types, wherein the types comprise names of people, places and organization names;
step 1.2: dividing the marked result into character level linguistic data in a BMESO marking mode, wherein the form is as follows: character entity location-predefined type to which it belongs;
step 1.3: and dividing the preprocessed data set into a training set, a test set and a verification set according to a certain proportion.
Further, the step 2 specifically includes the following steps:
step 2.1: for the characters in the sequence, performing character mapping on the character sequence one by using a pre-training language model, and encoding each character in the input sequence into a low-dimensional dense embedded vector;
step 2.2: for candidate words corresponding to characters in the sequence: establishing a vocabulary lookup tree based on an external dictionary, matching candidate words corresponding to characters in a sentence, establishing an expanded soft word set, and weighting the expanded soft word set corresponding to the characters by using a weight fusion strategy to obtain word-level vectors corresponding to the characters;
step 2.3: for the radical-level features corresponding to characters in the sequence: constructing a radical level feature lookup table for common Chinese characters, expressing features as pre-trained embedded vectors, and extracting radical level feature embedded vectors by using a convolutional neural network;
step 2.4: sequentially splicing characters, soft words and radical level feature vectors;
step 2.5: performing a padding/truncation operation on each sentence in the data set to a fixed length; for sentences the length of which exceeds the specified length, discarding the part exceeding the specified length; for sentences with the length less than the specified length, filling operation is carried out, and the sentences are filled to the specified length;
step 2.6: the method comprises the steps that sentences with fixed length are input into a model with the Size of Batch _ Size as a group, and each subsequence in Batch is a sentence;
step 2.7: and carrying out forward LSTM coding and backward LSTM coding on the feature vectors in the Batch on a hidden layer, and carrying out vector splicing on the forward and backward hidden vectors to obtain the bidirectional feature vectors of the data.
Further, the step 2.2 specifically includes the following steps:
step 2.2.1: traversing an external dictionary and constructing a vocabulary prefix search tree;
step 2.2.2: matching candidate words in the sentence by using a vocabulary searching tree, and constructing a soft word set for the characters according to the positions of the characters in the candidate words;
step 2.2.3: counting the total times of the candidate words appearing in the corpus data and the times of the candidate words appearing in each position of the soft word set to obtain the weight of the candidate words in each position of the soft word set;
step 2.2.4: and weighting the candidate words at all positions corresponding to the characters, and splicing soft word level vectors.
Further, the step 2.3 specifically includes the following steps:
step 2.3.1: a radical level characteristic lookup table is constructed for common Chinese characters, and the radical level characteristics of the lookup table comprise: the simplified/traditional radicals of the characters, the structural composition of the characters and the writing sequence of the characters are as follows: character-radical-construct composition-writing order sequence;
step 2.3.2: searching a pre-trained embedded vector lookup table, and representing each part first-level feature corresponding to the character as an embedded vector with a dimension d, wherein the part first-level feature corresponding to the character is represented as an embedded vector matrix at the moment;
step 2.3.3: dimension of the fixed embedding matrix is
Figure BDA0003556417850000031
For the matrix with the length exceeding k, performing truncation operation to take the first k characteristics; for the matrix with the length less than k, randomly initializing the filling length to k;
step 2.3.4: and carrying out x times of continuous one-dimensional convolution on the radical-level feature embedding matrix with fixed dimensionality and carrying out maximum pooling operation to obtain a d-dimensional embedding vector to represent the radical-level features corresponding to the characters.
Further, the step 3 specifically includes the following steps:
step 3.1: performing iterative update calculation on the bidirectional feature vector in the hidden layer;
step 3.2: inputting the result into a CRF layer, iteratively updating the emission probability and the transition probability, and calculating a maximum score sequence;
step 3.3: and updating and storing the parameters of the trained model.
Further, the step 4 specifically includes the following steps:
step 4.1: taking characters as a unit of a Chinese text sequence to be recognized as input of a model;
step 4.2: and calculating and outputting an entity recognition result.
Further, the corpus data in the total number of occurrences of the statistical candidate word in the corpus data refers to a training set + a test set.
The invention provides electronic equipment, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the steps of the Chinese named entity identification method integrating multi-granularity information when executing the computer program.
The invention provides a computer readable storage medium for storing computer instructions, wherein the computer instructions, when executed by a processor, implement the steps of the Chinese named entity recognition method fusing multi-granularity information.
Compared with the prior art, the method has the beneficial effects that potential information in the sequence is fully utilized on the basis of the BilSTM model: the method has the advantages that the radical semantic information in the sequence is mined, the original soft word method is expanded, the challenges brought by the increase of the entity length are better met, and the accuracy of named entity identification is improved.
Drawings
FIG. 1 is a flow chart of a Chinese named entity recognition method incorporating multi-granularity information;
FIG. 2 is a model framework diagram of a Chinese named entity recognition method incorporating multi-granularity information;
FIG. 3 is a schematic diagram of a method for expanding soft words;
FIG. 4 is a detailed view of the radical level information of the character "hot";
fig. 5 is a block diagram of radical level feature extraction.
Detailed Description
The technical solutions in the embodiments of the present invention will be described below clearly and completely with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.
As shown in fig. 1 to 5, the present invention provides a method for identifying a chinese named entity by fusing multi-granularity information, which specifically includes the following steps:
step 1: acquiring a field corpus data set, preprocessing the data set and dividing the data set into a training set, a testing set and a verification set;
the step 1 specifically comprises the following steps:
step 1.1: named entities in the sentence-level corpus data are identified and labeled as predefined entity types, such as: person name, place name, organization name, etc.;
step 1.2: dividing the marked result into character level linguistic data in a BMESO marking mode, wherein the form is as follows: character entity location-predefined type to which it belongs;
step 1.3: the preprocessed data set is divided into a training set, a testing set and a verification set according to the ratio of 6:2: 2.
The character sequence s in the preprocessed data set is as follows:
s=[c1,c2,c3,…,cn]
wherein, ciRepresents the ith character in the character sequence, and i ∈ [1, n ]];ci,jRepresenting words consisting of the ith to jth character in a sequence of characters, i, j ∈ [1, n ]]And i is<J (some words may consist of single characters);
step 2: extracting characters, soft words and radical pre-training vectors in the corpus data preprocessed in the step 1 for vector fusion and constructing a Chinese named entity recognition model fusing multi-granularity information;
step 2.1: for the characters in the sequence, the character sequences are subjected to character mapping one by using a pre-training language model BERT-wwm, and each character c in the input sequence is subjected to character mappingiThe code is a 768-dimensional embedded vector, as follows:
Figure BDA0003556417850000051
wherein e iscA vector lookup table is embedded for the character.
As shown in FIG. 2, for the input sequence, each character c is first obtained using BERT-wwmiThe TokenEmbedding, segmentEmbedding, PositionEmbedding are combined to be used as a pre-training vector of the character.
Step 2.2: for candidate words corresponding to characters in the sequence: establishing a vocabulary lookup tree based on an external dictionary, matching candidate words corresponding to characters in a sentence, establishing an expanded soft word set, and then weighting the expanded soft word set corresponding to the characters by using a weight fusion strategy to obtain word-level vectors corresponding to the characters, wherein the method specifically comprises the following steps:
step 2.2.1: traversing an external dictionary, and constructing a vocabulary prefix search tree;
step 2.2.2: using a vocabulary search tree to match candidate words in a sentence, and constructing an extended soft word set for each character in the sequence according to the positions (including a start position, an intermediate group (a first intermediate position, a second intermediate position, other intermediate positions), an end position and a single word forming position) of the characters in the candidate words;
as shown in fig. 3, for the input sequence "chinese football team", all potential words in the sequence are first found: "Chinese football", "Chinese football team", "football", "team". For the character "ball", the words containing it include "team", "football team", "ChinaFootball team, football, ball, and the like (considering the word formation of single character), the word of ball is in the initial position in the word of team, so that the word of team is added to the word set of Begin position in the extended soft word set of ball, and the word of ball is in the word of team1Extended soft word set M of location, hence "ball1The position word set is added with a word ' football team ' and the ' ball ' word is positioned in M in the word ' football team2Extended soft word set M of location, hence "ball2The position word set is added with a word ' Chinese football team ', and the word ' ball ' is positioned in M in the word ' Chinese football teamoExtended soft word set M of location, hence "balloThe position word set is added with a word "Chinese football team", and the above operation is executed to obtain an extended soft word set as shown in fig. 3.
Step 2.2.3: counting the times z (w) of the candidate words in the corpus data (training set + test set) at a certain position in the soft word set, and counting the times of the soft word set in the data, weighting to obtain word embedding expression v weighted at each positionw(W):
Figure BDA0003556417850000061
Figure BDA0003556417850000062
Wherein W represents "BM1M2MoES "soft word set, W represents a certain candidate word in W.
Step 2.2.4: carrying out vector splicing on soft word vectors at different positions of the characters to obtain soft word level vectors z corresponding to the charactersw
In particular, zwCalculated by the following formula:
Figure BDA0003556417850000063
wherein
Figure BDA0003556417850000064
A stitching operation, v, representing a vectorw() A word embedding vector representing a soft word position.
Step 2.3: for the radical-level features corresponding to characters in the sequence: the method comprises the following steps of constructing a radical level feature lookup table for common Chinese characters, expressing features as pre-trained embedded vectors, and extracting the radical level feature embedded vectors by using a convolutional neural network, wherein the method specifically comprises the following steps:
step 2.3.1: building a radical level characteristic lookup table e for common Chinese charactersrIts radical level features include: simplified/traditional radicals of characters, structural composition of characters and writing sequence of characters are as follows: character-radical-construct composition-writing order sequence;
as shown in fig. 4, for the character "hot", its radical is: "fire", consisting of: the sequence of the Chinese characters' Zhi (a Chinese character) is: the radical features of "Tang" and "Huo" reflect the meaning of the characters.
Step 2.3.2: look up the pre-trained embedded vector lookup table and convert the character ci={r1,r2,…,rmEach radical-level feature rj,j∈[1,m]The embedded vector with dimension d is represented, and the radical-level feature corresponding to the character can be represented as an embedded vector matrix O at this time, as follows:
Figure BDA0003556417850000065
Figure BDA0003556417850000071
wherein er(rj) The first level of the expression part is embedded into a vector lookup table and is obtained by training Chinesegga-Word by using Word2 vec.
Step 2.3.3: dimension of the fixed embedding matrix is
Figure BDA0003556417850000072
For the matrix with the length exceeding k, performing truncation operation to take the first k characteristics; for the matrix with the length less than k, randomly initializing the filling length to k; after this step, the embedded vector matrix becomes:
Figure BDA0003556417850000073
step 2.3.4: as shown in fig. 5, x times of continuous one-dimensional convolution and maximum pooling operations are performed on the radical-level feature embedding matrix of fixed dimension, and a d-dimensional embedding vector representing the radical-level feature corresponding to the character is obtained.
Step 2.4: sequentially splicing the character, the soft word and the radical level feature vector, wherein the following formula is as follows:
Figure BDA0003556417850000074
wherein x iscEmbedding vectors, y, for charactersrEmbedding vectors, z, for radical levelwThe vectors are embedded for the soft-word level,
Figure BDA0003556417850000075
and representing the splicing operation of the vector, wherein the spliced result is a character embedded vector containing word-level information and radical-level information.
Step 2.5: performing a padding/truncation operation on each sentence in the data set to a fixed length, specifically, discarding a part exceeding a prescribed length for a sentence whose length exceeds the prescribed length; for sentences with the length less than the specified length, filling operation is carried out, and the sentences are filled to the specified length;
step 2.6: the method comprises the steps that sentences with fixed length are input into a model with the Size of Batch _ Size as a group, and each subsequence in Batch is a sentence;
step 2.7: and carrying out forward LSTM coding and backward LSTM coding on the feature vectors in the Batch on a hidden layer, and carrying out vector splicing on the forward and backward hidden vectors to obtain the bidirectional feature vectors of the data.
Figure BDA0003556417850000076
Figure BDA0003556417850000077
ht=ot⊙tanh(ct).
Where σ is the sigmoid function of the element, which indicates the product of the elements, and W and b are trainable parameters. The memory unit c can be regarded as long-term memory, and the hidden state h is short-term memory. Reverse LSTM shares the same definition as forward LSTM, but models the sequence in reverse order. The hidden state of step i, which is spliced between forward and backward LSTMs, forms ciIs context dependent.
And 3, step 3: inputting the data obtained in the step 2 into a model for training, and specifically comprising the following steps:
step 3.1: performing iterative update calculation on the bidirectional feature vector in the hidden layer;
step 3.2: inputting the result into a CRF layer, iteratively updating the emission probability and the transition probability, and calculating a maximum score sequence;
Figure BDA0003556417850000081
where P is the output of BilSTM, representing tag yiThe transfer matrix T represents the slave tag yiTo yi+1The transition probability of (2).
Step 3.3: and updating and storing the parameters of the trained model.
Specifically, the model is trained using a negative log-likelihood loss function, and overfitting is mitigated using L2 regularization, as follows:
Figure BDA0003556417850000082
where θ represents a parameter set and λ is a regularization parameter.
And 4, step 4: processing and calculating the data to be recognized by using the Chinese named entity recognition model fusing the multi-granularity information obtained in the step 3 to obtain a named entity recognition result, and specifically comprising the following steps of:
step 4.1: taking characters as a unit of a Chinese text sequence to be recognized as input of a model;
and 4.2: and calculating and outputting an entity recognition result.
The invention provides electronic equipment, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the step of the Chinese named entity identification method fusing multi-granularity information when executing the computer program.
The invention provides a computer readable storage medium for storing computer instructions, which when executed by a processor implement the steps of the Chinese named entity recognition method fusing multi-granularity information.
The method, the device and the medium for identifying the Chinese named entity fusing multi-granularity information, which are provided by the invention, are introduced in detail, a specific example is applied in the text to explain the principle and the implementation mode of the invention, and the description of the embodiment is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims (10)

1. A Chinese named entity recognition method fusing multi-granularity information is characterized by comprising the following steps:
step 1: acquiring a field corpus data set, preprocessing the data set, and dividing the data set into a training set, a test set and a verification set;
and 2, step: extracting characters, soft words and radical pre-training vectors in the corpus data preprocessed in the step 1 for vector fusion, and constructing a Chinese named entity recognition model fusing multi-granularity information;
and 3, step 3: inputting the data obtained in the step (2) into a model for training;
and 4, step 4: and (4) processing and calculating the data to be recognized by using the Chinese named entity recognition model fused with the multi-granularity information obtained in the step (3) to obtain a named entity recognition result.
2. The method according to claim 1, characterized in that said step 1 comprises in particular the steps of:
step 1.1: identifying named entities in sentence-level corpus data and marking the named entities as predefined types, wherein the types comprise names of people, places and organization names;
step 1.2: dividing the marked result into character level linguistic data in a BMESO marking mode, wherein the form is as follows: character entity location-the predefined type to which it belongs;
step 1.3: and dividing the preprocessed data set into a training set, a testing set and a verification set according to a certain proportion.
3. The method according to claim 2, wherein step 2 comprises in particular the steps of:
step 2.1: for the characters in the sequence, performing character mapping on the character sequence one by using a pre-training language model, and encoding each character in the input sequence into a low-dimensional dense embedded vector;
step 2.2: for candidate words corresponding to characters in the sequence: establishing a vocabulary lookup tree based on an external dictionary, matching candidate words corresponding to characters in a sentence, establishing an expanded soft word set, and weighting the expanded soft word set corresponding to the characters by using a weight fusion strategy to obtain word-level vectors corresponding to the characters;
step 2.3: for the radical level features corresponding to characters in the sequence: constructing a radical level feature lookup table for common Chinese characters, expressing features as pre-trained embedded vectors, and extracting radical level feature embedded vectors by using a convolutional neural network;
step 2.4: sequentially splicing characters, soft words and radical level feature vectors;
step 2.5: performing filling/truncation operation on each sentence in the data set to be in a fixed length; discarding the part exceeding the specified length for the sentence with the length exceeding the specified length; for sentences with the length less than the specified length, filling operation is carried out, and the sentences are filled to the specified length;
step 2.6: the method comprises the steps that sentences with fixed length are input into a model with the Size of Batch _ Size as a group, and each subsequence in Batch is a sentence;
step 2.7: and carrying out forward LSTM coding and backward LSTM coding on the feature vectors in the Batch on a hidden layer, and carrying out vector splicing on the forward and backward hidden vectors to obtain the bidirectional feature vectors of the data.
4. The method according to claim 3, characterized in that said step 2.2 comprises in particular the steps of:
step 2.2.1: traversing an external dictionary, and constructing a vocabulary prefix search tree;
step 2.2.2: using a vocabulary search tree to match candidate words in a sentence, and constructing a soft word set for the characters according to the positions of the characters in the candidate words;
step 2.2.3: counting the total times of the candidate words appearing in the corpus data and the times of the candidate words appearing in each position of the soft word set to obtain the weight of the candidate words in each position of the soft word set;
step 2.2.4: and weighting the candidate words at all positions corresponding to the characters, and splicing soft word level vectors.
5. The method according to claim 4, characterized in that said step 2.3 comprises in particular the steps of:
step 2.3.1: a radical level characteristic lookup table is constructed for common Chinese characters, and the radical level characteristics of the lookup table comprise: the simplified/traditional radicals of characters, the structural composition of characters and the writing sequence of characters are in the form: character-radical-construct composition-writing order sequence;
step 2.3.2: searching a pre-trained embedded vector lookup table, and representing each part first-level feature corresponding to the character as an embedded vector with a dimension d, wherein the part first-level feature corresponding to the character is represented as an embedded vector matrix at the moment;
step 2.3.3: dimension of the fixed embedding matrix is
Figure FDA0003556417840000021
For the matrix with the length exceeding k, performing truncation operation to take the first k characteristics; for the matrix with the length less than k, randomly initializing the filling length to k;
step 2.3.4: and carrying out x times of continuous one-dimensional convolution on the radical level feature embedding matrix with fixed dimensionality and carrying out maximum pooling operation to obtain a d-dimensional embedding vector to represent the radical level features corresponding to the characters.
6. The method according to claim 5, wherein the step 3 comprises the following steps:
step 3.1: performing iterative update calculation on the bidirectional feature vector in the hidden layer;
step 3.2: inputting the result into a CRF layer, iteratively updating the emission probability and the transition probability, and calculating a maximum score sequence;
step 3.3: and updating and storing the parameters of the trained model.
7. The method according to claim 6, wherein the step 4 comprises the steps of:
step 4.1: taking characters as a unit of a Chinese text sequence to be recognized as input of a model;
step 4.2: and calculating and outputting an entity recognition result.
8. The method of claim 4, wherein the corpus data in the total number of occurrences of the statistical candidate word in the corpus data is referred to as a training set + a test set.
9. An electronic device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method according to any one of claims 1-8 when executing the computer program.
10. A computer-readable storage medium storing computer instructions, which when executed by a processor implement the steps of the method of any one of claims 1 to 8.
CN202210277553.6A 2022-03-21 2022-03-21 Chinese named entity recognition method, equipment and medium fusing multi-granularity information Pending CN114781380A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210277553.6A CN114781380A (en) 2022-03-21 2022-03-21 Chinese named entity recognition method, equipment and medium fusing multi-granularity information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210277553.6A CN114781380A (en) 2022-03-21 2022-03-21 Chinese named entity recognition method, equipment and medium fusing multi-granularity information

Publications (1)

Publication Number Publication Date
CN114781380A true CN114781380A (en) 2022-07-22

Family

ID=82426218

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210277553.6A Pending CN114781380A (en) 2022-03-21 2022-03-21 Chinese named entity recognition method, equipment and medium fusing multi-granularity information

Country Status (1)

Country Link
CN (1) CN114781380A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116579343A (en) * 2023-05-17 2023-08-11 成都信息工程大学 Named entity identification method for Chinese text travel class
CN116629267A (en) * 2023-07-21 2023-08-22 云筑信息科技(成都)有限公司 Named entity identification method based on multiple granularities
CN117744787A (en) * 2024-02-20 2024-03-22 中国电子科技集团公司第十研究所 Intelligent measurement method for first-order research rule knowledge rationality

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116579343A (en) * 2023-05-17 2023-08-11 成都信息工程大学 Named entity identification method for Chinese text travel class
CN116579343B (en) * 2023-05-17 2024-06-04 成都信息工程大学 Named entity identification method for Chinese text travel class
CN116629267A (en) * 2023-07-21 2023-08-22 云筑信息科技(成都)有限公司 Named entity identification method based on multiple granularities
CN116629267B (en) * 2023-07-21 2023-12-08 云筑信息科技(成都)有限公司 Named entity identification method based on multiple granularities
CN117744787A (en) * 2024-02-20 2024-03-22 中国电子科技集团公司第十研究所 Intelligent measurement method for first-order research rule knowledge rationality
CN117744787B (en) * 2024-02-20 2024-05-07 中国电子科技集团公司第十研究所 Intelligent measurement method for first-order research rule knowledge rationality

Similar Documents

Publication Publication Date Title
CN111783462B (en) Chinese named entity recognition model and method based on double neural network fusion
CN106980683B (en) Blog text abstract generating method based on deep learning
CN108416058B (en) Bi-LSTM input information enhancement-based relation extraction method
CN109871538A (en) A kind of Chinese electronic health record name entity recognition method
CN112711948B (en) Named entity recognition method and device for Chinese sentences
CN110263325B (en) Chinese word segmentation system
CN114781380A (en) Chinese named entity recognition method, equipment and medium fusing multi-granularity information
CN114003698B (en) Text retrieval method, system, equipment and storage medium
CN112541356B (en) Method and system for recognizing biomedical named entities
CN110866098B (en) Machine reading method and device based on transformer and lstm and readable storage medium
CN114676234A (en) Model training method and related equipment
CN113177412A (en) Named entity identification method and system based on bert, electronic equipment and storage medium
CN111966812A (en) Automatic question answering method based on dynamic word vector and storage medium
CN113204611A (en) Method for establishing reading understanding model, reading understanding method and corresponding device
CN110472062A (en) The method and device of identification name entity
CN112464669A (en) Stock entity word disambiguation method, computer device and storage medium
CN115600597A (en) Named entity identification method, device and system based on attention mechanism and intra-word semantic fusion and storage medium
CN112784603A (en) Patent efficacy phrase identification method
CN114757184B (en) Method and system for realizing knowledge question and answer in aviation field
CN114510946A (en) Chinese named entity recognition method and system based on deep neural network
CN117291265B (en) Knowledge graph construction method based on text big data
CN111145914A (en) Method and device for determining lung cancer clinical disease library text entity
CN116720519B (en) Seedling medicine named entity identification method
CN114626378B (en) Named entity recognition method, named entity recognition device, electronic equipment and computer readable storage medium
CN117371447A (en) Named entity recognition model training method, device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination