CN115146644A - Multi-feature fusion named entity identification method for warning situation text - Google Patents
Multi-feature fusion named entity identification method for warning situation text Download PDFInfo
- Publication number
- CN115146644A CN115146644A CN202211063791.3A CN202211063791A CN115146644A CN 115146644 A CN115146644 A CN 115146644A CN 202211063791 A CN202211063791 A CN 202211063791A CN 115146644 A CN115146644 A CN 115146644A
- Authority
- CN
- China
- Prior art keywords
- text
- character
- features
- alarm
- label
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/242—Dictionaries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Character Discrimination (AREA)
- Machine Translation (AREA)
Abstract
The invention relates to the technical field of named entity recognition of natural language processing, in particular to a warning situation text-oriented multi-feature fusion named entity recognition method, which comprises the steps of firstly, constructing a data set for warning situation named entity recognition, defining an entity type to be recognized, and dividing the entity type into a training set, a verification set and a test set; secondly, character features of the text are obtained by using the pre-trained word vectors, the text matching is carried out on the basis of rules and a dictionary to obtain pre-identified label features, and the text is subjected to pinyin conversion to obtain pinyin features; finally, fusing the three characteristics, and sending the three characteristics into a bidirectional long and short term memory network-conditional random field model for named entity recognition; the invention constructs a multi-feature fused Chinese named entity recognition method, effectively represents the ambiguity of characters by fusing text character features, pre-recognized label features and pinyin features, and improves the accuracy rate, the recall rate and the comprehensive evaluation index F1 value of the recognition of the warning text named entity.
Description
Technical Field
The invention relates to the technical field of named entity recognition of natural language processing, in particular to a multi-feature fusion named entity recognition method for an alarm situation text.
Background
Named entity recognition is one of subtasks of information extraction, and the main purpose of the named entity recognition is to recognize entities with specific meanings in texts, such as names of people, places, organizations, proper nouns, and classify the entities, and is one of upstream tasks of natural language processing related tasks of text understanding, machine translation, question and answer systems, and the like. The core of the alarm named entity recognition is to effectively recognize various alarm elements such as a case-sending place, a case-making means and the like, provide powerful data support for a downstream alarm analysis task, and is the key of constructing alarm knowledge maps, series-parallel analysis, abnormal early warning, high-rate early warning and other tasks. How to accurately identify and extract effective named entities from the warning situation text becomes one of the fundamental and critical jobs in the field of warning situation analysis and mining.
Although there are many mature models and algorithms to solve the named entity recognition problem, for example, CN111460820B discloses a network space security domain named entity recognition method and apparatus based on a pre-trained model BERT. The method comprises the steps that an input text in the network space security field is subjected to word segmentation preprocessing by using a word segmentation device WordPiece of a BERT model, all tokens obtained through word segmentation preprocessing are loaded into the BERT model to be trained, output vector representation is obtained, then the output vector representation is sent to a Highway network and a classifier, dimensions represented by the vectors of the tokens are mapped to dimensions consistent with the number of labels, and final vector representation of the tokens is obtained; and finally, calculating loss by using a cross entropy loss function only by using the first token of each word, and reversely propagating the loss to update the model parameters to obtain the trained named entity recognition model for recognizing the named entity in the security field.
CN111460824B discloses a label-free named entity identification method based on anti-migration learning, to construct a label-free named entity identification model, first, inputting a text of a source field or a target field and mapping the text into a word embedding vector; inputting the word embedding vector into a bidirectional long-short term memory network to extract a characteristic vector; inputting the characteristic vector into a countermeasure discriminator, and mapping the data of the source field and the data of the target field to the same data distribution space; inputting the feature vector into a conditional random field, calculating the probability of all possible label sequences of the input text, and selecting a label with the maximum probability as a final predicted label; obtaining the optimal model parameters by jointly training the named entity recognition task and the confrontation training task; inputting data of a target field, and outputting a prediction label through a CRF (conditional random field) layer.
As mentioned above, although there are mature models and algorithms for solving the named entity recognition problem, the general named entity recognition model and algorithm are not good for the named entity recognition in a specific field (e.g. alarm analysis), and there are some challenges and difficulties in alarm named entity recognition compared to the named entity recognition research in the general field. Firstly, the current warning situation field lacks high-quality labeled data, so that a warning situation named entity is required to be customized in the warning situation field, and a professional data set is labeled for recognition of the warning situation named entity. The sentence patterns and grammars of the warning situation texts often have certain fixed characteristics, but the existing named entity recognition pursues an end-to-end neural network model, ignores the relevant knowledge and characteristics of the field, separates the entity recognition model from the relevant rules and dictionaries of the field, and does not effectively fuse the external knowledge and the model. The alarm situation text is the report information recorded by the alarm receiving policeman according to the description of the alarm, is usually not standard, has heavy spoken language, and may contain errors such as wrongly written characters, homophones (such as place names, person names and the like) and the like caused by pronunciation problems, and the current named entity recognition model does not consider the pinyin characteristics of the text.
Disclosure of Invention
Aiming at the problems that high-quality labeling data and an effective named entity recognition method are lacked in the field of warning situations at present, the invention provides a warning situation text-oriented multi-feature fusion named entity recognition method, wherein text character features, label features and pinyin features which are pre-recognized through rules and dictionaries are fused to jointly act on a warning situation named entity recognition model, and the warning situation named entity recognition model is used on a self-built warning situation text data set, so that the accuracy rate, the recall rate and the comprehensive evaluation index F1 value of the warning situation named entity recognition are effectively improved.
The specific technical scheme of the invention is as follows:
a multi-feature fusion named entity recognition method for an alert text comprises the following steps:
step S1: extracting the alarm text information, classifying according to the alarm cases, and extracting corresponding types of alarm texts from an alarm database;
step S2: constructing a named entity identification data set of an alarm situation text, and dividing the named entity identification data set into a training set, a verification set and a test set;
and step S3: extracting character features, namely acquiring a character vector corresponding to each character of the data in the entity identification data set as character features aiming at the data in the entity identification data set;
and step S4: extracting tag characteristics, namely defining respective rules and dictionaries for various entities, performing character string matching by using the rules and the dictionaries, identifying an alarm text, performing vectorization representation on the obtained identification tag, and taking the vectorization representation of the tag as the tag characteristics;
step S5: the method comprises the steps of pinyin feature extraction, wherein the pinyin of each character in an alarm text is obtained and vectorized, and the vectorized representation of the pinyin is used as the pinyin feature of the character;
step S6: multi-feature fusion, namely fusing three feature vectors of character features, label features and pinyin features; the multi-feature fusion adopts one of direct splicing fusion, additive fusion or splicing fusion after feature extraction;
step S7: model training, namely constructing a multi-feature fusion named entity recognition model, extracting character features, label features and pinyin features from training set data, inputting the extracted character features, label features and pinyin features into a bidirectional long-short term memory network, and capturing constraints and dependency relations among labels by using a conditional random field;
step S8: model testing, namely sending test set data to a multi-feature fusion named entity recognition model to obtain a prediction label, comparing the prediction label with an actual label, calculating the number of correct and wrong detections in a test sample, and solving a recognition accuracy P, a recall rate R and a comprehensive evaluation index F1 value, wherein the calculation mode of the comprehensive evaluation index F1 value is as follows:
step S9: and (4) carrying out named entity recognition on the rest unmarked warning situation texts in the warning situation database by using the multi-feature fusion named entity recognition model trained in the step (S7).
In step S2, the specific steps of constructing the alert naming entity identification data set are as follows:
step S201: data cleaning, namely cleaning the data of the alarm condition text, and removing abnormal symbols, messy codes and repeated data;
step S202: entity definition, namely self-defining 6 types of entities, including names of alarm persons, addresses of alarm cases, case-related articles, case properties, amount of case-related property and license plate numbers of case-related objects;
step S203: entity marking, namely manually marking the warning situation text according to a self-defined 6-type entity by adopting a BIOES marking standard of word granularity;
step S204: and (3) data division, namely dividing the marked data into a training set, a verification set and a test set according to a certain proportion.
S3, the specific steps of the character feature extraction process are as follows:
step S301: training on an Baidu encyclopedia corpus by adopting a text vectorization tool Word2Vec to obtain pre-trained Word vectors, wherein the Word vectors comprise character vectors and Word vectors;
step S302: for each piece of data in the named entity identification data set constructed in the step S2, sequence definition is carried out on characters of each piece of data to obtain a text sequence S of each piece of data;
step S303: for each character in the text sequence S, a character vector corresponding to each character is determined as a character feature according to the word vector pre-trained in step S301.
In step S4, generating a pre-identified tag feature, specifically including the following steps:
step S401: defining different rules or dictionaries according to different entities, and preliminarily identifying the content of the warning text;
step S402: the entity identification of the alarm person, the text before the character 'newspaper' in the alarm situation text is marked as the entity of the alarm person,
step S403: identifying an address entity, constructing an ending word dictionary, and marking words behind words and words before ending words in the warning text as the address entity through character string matching;
step S404: identifying the entity of the involved case amount, and identifying the involved case amount in the warning situation text by using a regular expression;
step S405: identifying a vehicle license plate entity, namely identifying a license plate number entity in the warning text by using a regular expression;
step S406: identifying case property entities, namely dividing case types into 5 grades through analysis and statistics of an alarm text, wherein 354 case types are totally used, different types of cases have different vocabularies to describe the properties of the cases, and constructing a case property dictionary to match the case property entities;
step S407: identifying the entity of the involved articles, constructing a dictionary of the involved articles, and identifying by adopting character string matching;
step S408: label representation, for the identified entity, the first word is represented by "B-label", the middle character is represented by "I-label", the last word is represented by "E-label", and the rest of the characters not matched by the rule are represented by label "O";
step S409: after the above rules and dictionary recognition, for text sequencesS,Each character has a corresponding label to obtain a label sequence L;
step S410: vectorization representation, namely constructing a tag embedded lookup table and carrying out vectorization representation on the identified tags; a vectorized representation of the tag is taken as a tag feature.
In step S5, the obtaining of the pinyin features of the characters includes the following steps:
step S501: obtaining pinyin representations of different characters, wherein the tone of each character is placed behind the pinyin of the character;
step S502: vectorization expression, namely constructing a pinyin query table, vectorizing the pinyin, and taking the vectorization expression of the pinyin as the pinyin characteristics of the characters.
For the multi-feature fusion described in step S6, the steps are as follows:
step S601: the character characteristics, the label characteristics and the pinyin characteristics of each character are fused, and three fusion modes are designed as follows:
1) The direct splicing and fusion mode comprises the following steps: directly splicing the feature vectors in the three forms to form a final vector;
2) The adding and fusing mode is that the label characteristic, the pinyin characteristic and the character embedding dimension are the same, and then the values of the corresponding positions of each vector are directly added to form a final vector;
3) And after the features are extracted, splicing and fusing, extracting and splicing the three features by using different bidirectional long-term and short-term memory networks, and forming a final vector by using the extracted features.
In step S7, the multi-feature fusion vector obtained in step 6 is sent to a BiLSTM-CRF model for training, which may be specifically expressed as:
step S701: updating a forgetting gate, an input gate and an output gate of the long-short term memory network, and updating a state unit according to the updated forgetting gate, the input gate and the output gate to generate a next implicit vector state;
step S702: obtaining a context vector of each character by adopting a BilSTM model; processing the input sequence from the front direction and the back direction;
step S703: constructing a conditional random field, and capturing the dependency relationship among the labels to obtain a predicted label sequence;
and step S704, adjusting the model to the current optimal value by using the verification data set, and storing the current optimal model.
In step S8, model testing is carried out, test set data are sent to a multi-feature fusion named entity recognition model to obtain a prediction tag, the prediction tag is compared with an actual tag, the number of correct and false detections in a test sample is calculated, and the recognition accuracy P, the recall ratio R and the comprehensive evaluation index F1 value are obtained, and the specific steps are as follows:
step S801: sending the test set data to a trained multi-feature fusion model, and obtaining an identified label for each input sequence;
step S802: comparing the labels predicted by the model with the real labels, and counting the correct entity number predicted by the model, the total entity number predicted by the model and the total entity number in the data set;
step S803: and calculating the detection accuracy P, the recall rate R and the comprehensive evaluation index F1 value.
Compared with the prior art, the invention has the following beneficial effects:
aiming at the problem that the warning situation field is lack of high-quality labeling data, the invention defines 6 types of warning situation named entities, constructs a corpus for recognizing the warning situation named entities through manual labeling, designs a multi-feature fusion UMF-BilSTM-CRF model for recognizing the warning situation named entities, effectively improves the accuracy of recognizing the warning situation named entities, and has a comprehensive evaluation index F1 which can reach 93.3 percent at most and is 1.56 percent higher than that of the existing model.
The invention uses the character vector pre-trained on the Baidu encyclopedia corpus as the character feature, and improves the hot start capability of the network, accelerates the model training speed and further improves the overall recognition performance of the model compared with the method of directly using self-labeling field data for training.
The invention fully excavates the characteristics of the warning situation text, identifies the corresponding entity in advance by defining different rules and dictionaries, takes the identified label as the characteristic, adds the phonetic feature and enriches the representation of each character. And three feature fusion modes are provided, and character features, label features and voice features are fully fused. The experimental result verifies the effectiveness of the method. The method can be used for identifying the unmarked warning situation named entity.
Drawings
In order to clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following embodiments or the description in the prior art will be briefly described using the accompanying drawings, which are only used for illustration and should not be construed as limiting the present invention.
FIG. 1 is a schematic flow chart of the method of the present invention.
FIG. 2 is a schematic diagram of a direct splicing fusion method.
FIG. 3 is a schematic diagram of an additive fusion approach.
FIG. 4 is a schematic diagram of a splicing and merging method after extracting features of a bidirectional long-term and short-term memory network.
FIG. 5 is a diagram of a named entity recognition model with multi-feature fusion.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
In the present embodiment, unless otherwise specified, the sequence representation used includes a set of a plurality of objects, such as a text sequenceS = {c 1 ,c 2 ,…,c m },c 1 Representing a first object in a sequence,c 2 Representing the second object in the sequence, and so on, in this documentc 1 First in the finger sequenceiThe same applies to each subject, as follows.
The invention provides a multi-feature fusion named entity recognition method for an alarm text, which fuses rules and label features and pinyin features pre-recognized by a dictionary, and the flow diagram is shown in figure 1, and the method comprises the following steps:
step S1: extracting the alarm text information, wherein the classification of the alarm cases is a predefined hierarchical progressive structure, the classification is divided into at most 5 levels from coarse to fine, the finer case types are divided under each case type according to the service requirements, and the alarm texts covering all the first 2 case types are randomly extracted from an alarm database.
Step S2: and constructing a named entity recognition data set of the alarm situation text, and dividing the named entity recognition data set into a training set, a verification set and a test set.
In an optional embodiment of the present invention, the step S2 of constructing the alarm named entity identification data set specifically includes the following sub-steps:
step S201: data cleaning, namely simply cleaning the alarm text to remove abnormal symbols, messy codes and repeated data;
step S202: entity definition, namely self-defining 6 types of entities, including alarm person name (PER), alarm case address (LOC), case-related object (PRO), case property (CT), amount of money of case-related property (MON) and number plate number (CAR) of case-related object, wherein the marking specification of the alarm case named entity is shown in table 1;
TABLE 1 labeling Specifications for alert named entities
Step S203: and (4) entity labeling, namely manually labeling the warning situation text according to the customized 6 types of entities. The BIOES marking standard of word granularity is adopted.
Step S204: and (3) data division, namely dividing the marked data into a training set, a verification set and a test set, wherein 2000 training sets, 295 verification sets and 277 test sets approximately satisfy 8:1:1, in the distribution.
And step S3: and extracting character features, and vectorizing characters in the warning situation text to be used as character features.
In an optional embodiment of the present invention, the step S3 obtains the character features, and the basic process may be summarized as follows: firstly, aiming at each sentence in the warning situation text, extracting characters of the sentence and vectorizing the characters, wherein each vector is a multidimensional digital expression, and the vectorized characters are extracted as character features, and the method specifically comprises the following steps:
step S301: in the embodiment, word2Vec is used, and training is performed on an Baidu encyclopedia corpus to obtain pre-trained Word vectors, wherein the Word vectors comprise character vectors and Word vectors;
step S302: obtaining the dimensionality, the embedding matrix, the dictionary size, a dictionary-index table and an index-dictionary table of the word vector according to pre-training;
step S303: for each piece of data in the data set constructed in step S2, a text sequence is definedS = {c 1 , c 2 ,…,c m }Wherein the text sequenceSIn the step (1), the first step,C i represents the first in the sequenceiThe number of the characters is one,mrepresents the length of a sentence;
step S304: obtaining the corresponding character vector of each character according to the dictionary-index table and the embedded matrixe i c As character features.
And step S4: and extracting tag features, namely defining different rules and dictionaries for different entities, matching character strings by using the rules and the dictionaries, carrying out primary recognition on the warning situation text, and taking the obtained recognition tag as the tag features.
In an optional embodiment of the present invention, the step S4 of generating the pre-identified tag feature specifically includes the following sub-steps:
step S401: defining different rules or dictionaries according to different entities, and preliminarily identifying the content of the warning text;
step S402: the alarm entity is identified, and the alarm entity follows the principle that the text before the 'newspaper' word is the alarm entity. Thus the text before the "newspaper" word is labeled as the alarmer entity;
step S403: address entity recognition, the address entity follows the principle that the ' after ' word is the beginning of the address entity and usually ends with some words describing the address, so that an ending word dictionary can be constructed, and words after the ' word and words before the ending word are labeled as the address entity through character string matching;
step S404: identifying the involved money entity, wherein the involved money entity is usually started with numbers and ended with 'Yuan' characters, and the initial identification can be carried out by utilizing a regular expression;
step S405: the identification of the entity of the involved license plate is realized by combining the abbreviation of province and 6-digit English and numbers for the license plate number, so that the entity of the license plate number in the warning text can be identified by utilizing a regular expression;
step S406: case property entity recognition, wherein case types are totally divided into 5 grades through analysis and statistics of an alarm text, 354 case types are totally adopted, and cases of different types have different vocabularies for describing properties, such as 'stolen', 'robbed', 'dispute', and the like, so that a case property dictionary can be constructed for matching;
step S407: identifying the entity of the involved articles, constructing a common involved article dictionary, and identifying by adopting character string matching;
step S408: tag representation, for the identified entity, the first word is represented by a "B-tag", the middle character is represented by an "I-tag", and the last word is represented by an "E-tag". And the rest characters which are not matched by the rule are represented by a label 'O';
step S409: after the above rules and dictionary recognition, for text sequencesS = {c 1 ,c 2 ,…,c m },Each character has a corresponding label to obtain a label sequenceL = {label 1 ,label 2 ,…,label m }(ii) a In the above tag sequenceLIn (1),label i is shown asiA label of individual characters.
Step S410: vectorized representation, building tag-embedded look-up tablese label And vectorizing and representing the identified label. For the firstiVector quantized characterC i The corresponding label is vectorized and expressed ase t i =e label (label i )(ii) a Tagging vectorizatione t i As a label for each characterAnd (6) signature characteristics.
Step S5: pinyin feature extraction, namely taking the pinyin representation of each character as the pinyin feature of the character;
in an optional embodiment of the present invention, the obtaining of the pinyin characteristics of the character in the step S5 specifically includes the following sub-steps:
step S501: obtaining pinyin representation of different characters by using a pypinyin packet carried by Python, wherein the tone of each character is placed behind the pinyin of the character, and if the pinyin of a 'wind' character is represented as 'feng1';
step S502: vectorized representation, construction of phonetic query tablee pinyin The label is vectorized to express, for the character 'wind', the pinyin of the character 'feng1', and the obtained vectorized pinyin is expressed as. Vectorizing the pinyinAs the phonetic features of the character.
Step S6: multi-feature fusion, namely fusing three feature vectors of character features, label features and pinyin features;
in an optional embodiment of the present invention, the step S6 performs multi-feature fusion, and proposes three fusion manners, and schematic diagrams can be seen in fig. 2 to 4, and specifically includes the following sub-steps:
step S601:,,are respectively charactersCharacter features, label features, and pinyin features. The three characteristics are fused.Designing three fusion modes;
1) The localization direct splicing fusion mode directly splices the feature vectors of the three forms, and then for the charactersIn terms of the final fused vector,A splice is indicated. If the character feature is embedded in the dimension ofcThe label embedding dimension istThe dimension of pinyin embedding ispThen the dimension of the final embedded vector isc+t+p;
2) And in the Add addition fusion mode, the condition of consistent dimension is required to be met by using an Add direct addition mode, so that the label and the Pinyin feature are required to be the same as the dimension of character embedding, and then the values of the corresponding positions of each vector are directly added. The resulting vector is represented as;
3) And after LSTM extracts features, splicing and fusing. Extracting three characteristics by using different bidirectional long-short term memory networks, and splicing the extracted characteristics to form a final vector:(ii) a For each characterAfter the direct splicing and fusion mode by localization, the feature vector can be expressed as。
Step S7: model training, namely constructing a multi-Feature fusion named entity recognition model UMF-BilSTM-CRF model, wherein UMF represents UnfiedMulti-Feature, namely unified multi-Feature, inputting three features extracted from training set data into a bidirectional long-short term memory network (BilSTM), and capturing constraint and dependency relationship among labels by using a Conditional Random Field (CRF);
in an optional embodiment of the present invention, the multi-feature fusion vector obtained in step S7 is fed into a BiLSTM-CRF model for training, and a model schematic diagram can be shown in fig. 5, which specifically includes the following sub-steps:
step S701: updating the gate structure of the long-short term memory network (LSTM), and the concrete formula is as follows:
(1) update the forget gate of LSTM:
wherein, the first and the second end of the pipe are connected with each other,is a forgetting gate of the LSTM,is a function of the Sigmoid and is,in order to forget the weight matrix of the gate,in order to input the vector, the vector is input,is the implicit state of the LSTM at a time prior to the current time t,a bias term for a forget gate;
(2) input gate to update LSTM:
wherein, the first and the second end of the pipe are connected with each other,is an input gate for the LSTM and,is the weight matrix of the input gate,is the bias term of the input gate;
(3) update output gate of LSTM:
wherein, the first and the second end of the pipe are connected with each other,is an output gate of the LSTM and,is a weight matrix of the output gates,is the bias term of the output gate;
Wherein the content of the first and second substances,is a forgetting gate of the LSTM,the state of the cell at the last moment,is a unitThe weight matrix of the state is then determined,a bias term that is a cell state;
(5) generating the next implicit vector state:
wherein the content of the first and second substances,for the implicit state of LSTM at the current time t,in the state of the cell, it is,is a hyperbolic tangent function;
step S702: the present embodiment uses the BiLSTM model to obtain a context vector for each character. The BilSTM, a bidirectional LSTM model, can process an input sequence from front and back directions, and uses the method for a given sentenceRepresents at the timetThe forward direction output is carried out,indicating a backward output, thentAt time, the implicit state of LSTM is taken as output and expressed as:
step S703: and (3) constructing a conditional random field, and capturing the dependency relationship among the labels to obtain a predicted label sequence.
(1) Computing an emission matrix
Wherein the content of the first and second substances,Oin order to be a transmit matrix,this is the state sequence of the BilSTM output.
whereinOIs the transmit matrix, i.e., the fractional value of each class of tag prediction,denotes the firstiThe character is predicted as a labelThe probability of (a) of (b) being,is a transition probability matrix, representing the transition probabilities between label categories,indicating labelIs transferred toThe probability of (a) of (b) being,representing all possible tag sequences.
(3) The loss function during training L is:
is a sequence of sentences, and the sentence sequence,the tags that are the corresponding occurrences of the sentence,representing sentencesIs marked as a tag sequenceThe probability of (c).
In the training process, the model is trained by minimizing the negative log-likelihood probability at the sentence level.
(4) In prediction, the Viterbi algorithm is used to find out the label sequence with the highest score:
And step S704, adjusting the model to be optimal according to the comprehensive evaluation index F1 value by using the verification data set, and storing the optimal model M1.
Step S8: model testing, namely sending test set data to a model to obtain a prediction label, comparing the prediction label with an actual label to calculate the number of correct/wrong detections of a test sample, and solving detection accuracy, recall rate and a comprehensive evaluation index F1 value;
in an optional embodiment of the present invention, the step S7 of performing model testing, sending the test set data to the model to obtain a prediction label, comparing the prediction label with the actual label to calculate the number of correct/incorrect detections of the test sample, and obtaining the detection accuracy, precision, and recall ratio specifically includes the following sub-steps:
step S801: sending the test set data to a trained multi-feature fusion model, and inputting each sequenceS = {c 1 ,c 2 ,…,c m },Label y = { y) that will result in model prediction 1 ,y 2 ,…,y m };
Step S802: label y = { y ] predicted by comparative model 1 ,y 2 ,…,y m -and true labels, counting the number of entities the model predicts correctly, the total number of entities the model predicts, the total number of entities in the dataset;
step S803: the detection accuracy P (Precision), recall R (Recall) and integrated evaluation index F1 values were calculated according to the following formulas:
precision = number of entities correct for model prediction/total number of entities predicted by model
Recall = number of entities correctly predicted by model/total number of entities in dataset
The test results of the present invention are shown in table 2:
TABLE 2 recognition effect of alarm named entities of different models
The invention compares the named entity recognition results of 5 methods on the warning situation text, compares the accuracy, the recall rate and the comprehensive evaluation index F1 value of different models, and can see that the multi-feature fusion named entity recognition model based on the proposed method has the best effect from the table 2, wherein the accuracy reaches 95.91%, the recall rate reaches 90.92%, and the comprehensive evaluation index F1 value reaches 93.30%. The experimental results prove the correctness and effectiveness of the method.
Step S9: and carrying out named entity recognition on the rest unmarked warning condition texts in the warning condition database by using the trained UMF-BilSTM-CRF.
While the invention has been described with reference to specific preferred embodiments, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.
Claims (8)
1. A multi-feature fusion named entity identification method for an alarm situation text is characterized by comprising the following steps:
step S1: extracting the alarm text information, classifying according to the alarm cases, and extracting corresponding types of alarm texts from an alarm database;
step S2: constructing a named entity identification data set of an alarm situation text, and dividing the named entity identification data set into a training set, a verification set and a test set;
and step S3: extracting character features, namely acquiring a character vector corresponding to each character of the data in the entity identification data set as character features aiming at the data in the entity identification data set;
and step S4: extracting tag characteristics, namely defining respective rules and dictionaries for various entities, performing character string matching by using the rules and the dictionaries, identifying an alarm text, performing vectorization expression on the obtained identification tags, and taking the vectorization expression of the tags as tag characteristics;
step S5: pinyin feature extraction, namely acquiring pinyin of each character in the warning situation text, performing vectorization expression, and taking the vectorization expression of the pinyin as the pinyin feature;
step S6: multi-feature fusion, namely fusing three feature vectors of character features, label features and pinyin features; the multi-feature fusion adopts one of direct splicing fusion, additive fusion or splicing fusion after feature extraction;
step S7: model training, namely constructing a multi-feature fusion named entity recognition model, extracting character features, label features and pinyin features from training set data, inputting the extracted character features, label features and pinyin features into a bidirectional long-short term memory network, and capturing constraints and dependency relations among labels by using a conditional random field;
step S8: model testing, namely sending test set data to a multi-feature fusion named entity recognition model to obtain a prediction label, comparing the prediction label with an actual label, calculating the number of correct and wrong detections in a test sample, and solving a recognition accuracy P, a recall rate R and a comprehensive evaluation index F1 value, wherein the calculation mode of the comprehensive evaluation index F1 value is as follows:
step S9: and (4) carrying out named entity recognition on the rest unmarked warning situation texts in the warning situation database by using the multi-feature fusion named entity recognition model trained in the step (S7).
2. The method for recognizing the alarm text-oriented multi-feature fusion named entity as claimed in claim 1, wherein in step S2, the specific steps of constructing the alarm named entity recognition data set are as follows:
step S201: data cleaning, namely cleaning the data of the alarm condition text, and removing abnormal symbols, messy codes and repeated data;
step S202: entity definition, namely self-defining 6 types of entities, including names of alarm persons, addresses of alarm cases, case-related articles, case properties, amount of case-related property and license plate numbers of case-related objects;
step S203: entity marking, namely manually marking the warning situation text according to a self-defined 6-type entity by adopting a BIOES marking standard of word granularity;
step S204: and (3) data division, namely dividing the marked data into a training set, a verification set and a test set according to a certain proportion.
3. The method for recognizing the alarm text-oriented multi-feature fusion named entity as claimed in claim 2, wherein the step S3 of extracting the character features comprises the following specific steps:
step S301: training on an Baidu encyclopedia corpus by adopting a text vectorization tool Word2Vec to obtain pre-trained Word vectors, wherein the Word vectors comprise character vectors and Word vectors;
step S302: for each piece of data in the named entity identification data set constructed in the step S2, sequence definition is carried out on characters of each piece of data to obtain a text sequence S of each piece of data;
step S303: for each character in the text sequence S, a character vector corresponding to each character is determined as a character feature according to the word vector pre-trained in step S301.
4. The method for recognizing the alarm text-oriented multi-feature fusion named entity according to claim 3, wherein the step S4 of generating the pre-recognized tag feature specifically comprises the following steps:
step S401: defining different rules or dictionaries according to different entities, and preliminarily identifying the content of the warning situation text;
step S402: the alarm person entity is identified, the text before the alarm word in the alarm situation text is marked as the alarm person entity,
step S403: identifying address entities, constructing a final word dictionary, and marking words behind the words and words before the final words in the warning text as the address entities through character string matching;
step S404: identifying the entity of the involved case amount, and identifying the involved case amount in the warning situation text by using a regular expression;
step S405: identifying a vehicle license plate entity, namely identifying a license plate number entity in the warning text by using a regular expression;
step S406: identifying case property entities, namely dividing case types into 5 grades through analysis and statistics of an alarm text, wherein 354 case types are totally used, different types of cases have different vocabularies to describe the properties of the cases, and constructing a case property dictionary to match the case property entities;
step S407: identifying the entity of the involved articles, constructing a dictionary of the involved articles, and identifying by adopting character string matching;
step S408: label representation, for the identified entity, the first word is represented by "B-label", the middle character is represented by "I-label", the last word is represented by "E-label", and the rest of the characters not matched by the rule are represented by label "O";
step S409: after the above rules and dictionary recognition, for text sequencesS,Each character has a corresponding label to obtain a label sequence L;
step S410: vectorization representation, namely constructing a tag embedded lookup table and carrying out vectorization representation on the identified tags; a vectorized representation of the tag is taken as a tag feature.
5. The method for recognizing the alarm text-oriented multi-feature fusion named entity as claimed in claim 4, wherein in the step S5, the step of obtaining the pinyin features of the characters comprises the steps of:
step S501: obtaining pinyin representations of different characters, wherein the tone of each character is placed behind the pinyin of the character;
step S502: vectorization expression, namely constructing a pinyin query table, vectorizing the pinyin, and taking the vectorization expression of the pinyin as the pinyin characteristics of the characters.
6. A method for recognizing a multi-feature fusion named entity oriented to an alert text as claimed in claim 5, wherein the multi-feature fusion in step S6 comprises the following steps:
step S601: the character features, the label features and the pinyin features of each character are fused, and three fusion modes are designed as follows:
1) The direct splicing and fusion mode is as follows: directly splicing the feature vectors in the three forms to form a final vector;
2) The adding and fusing mode is that the label characteristic, the pinyin characteristic and the character embedding dimension are the same, and then the values of the corresponding positions of each vector are directly added to form a final vector;
3) And after the features are extracted, splicing and merging, extracting and splicing the three features by using different bidirectional long-term and short-term memory networks, and forming a final vector by using the extracted features.
7. The method for recognizing the alarm text-oriented multi-feature fusion named entity of claim 6, wherein in step S7, the multi-feature fusion vector obtained in step 6 is fed into a BiLSTM-CRF model for training, which is specifically represented as:
step S701: updating a forgetting gate, an input gate and an output gate of the long-short term memory network, and updating a state unit according to the updated forgetting gate, the input gate and the output gate to generate a next implicit vector state;
step S702: obtaining a context vector of each character by adopting a BilSTM model; processing the input sequence from the front direction and the back direction;
step S703: constructing a conditional random field, and capturing the dependency relationship among the labels to obtain a predicted label sequence;
and step S704, adjusting the model to the current optimal value by using the verification data set, and storing the current optimal model.
8. The method for recognizing the multi-feature fusion named entity oriented to the alert text according to claim 7, wherein in the step S8, model testing is performed, test set data is sent to the multi-feature fusion named entity recognition model to obtain a prediction tag, the prediction tag is compared with an actual tag, the number of correct and false detections in a test sample is calculated, and a recognition accuracy P, a recall rate R and a comprehensive evaluation index F1 value are obtained, and the specific steps are as follows:
step S801: sending the test set data to a trained multi-feature fusion model, and obtaining an identified label for each input sequence;
step S802: comparing the labels predicted by the model with the real labels, and counting the correct entity number predicted by the model, the total entity number predicted by the model and the total entity number in the data set;
step S803: and calculating the detection accuracy P, the recall rate R and the comprehensive evaluation index F1 value.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211063791.3A CN115146644B (en) | 2022-09-01 | 2022-09-01 | Alarm situation text-oriented multi-feature fusion named entity identification method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211063791.3A CN115146644B (en) | 2022-09-01 | 2022-09-01 | Alarm situation text-oriented multi-feature fusion named entity identification method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115146644A true CN115146644A (en) | 2022-10-04 |
CN115146644B CN115146644B (en) | 2022-11-22 |
Family
ID=83416670
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211063791.3A Active CN115146644B (en) | 2022-09-01 | 2022-09-01 | Alarm situation text-oriented multi-feature fusion named entity identification method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115146644B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116050418A (en) * | 2023-03-02 | 2023-05-02 | 浙江工业大学 | Named entity identification method, device and medium based on fusion of multi-layer semantic features |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109165384A (en) * | 2018-08-23 | 2019-01-08 | 成都四方伟业软件股份有限公司 | A kind of name entity recognition method and device |
CN112434520A (en) * | 2020-11-11 | 2021-03-02 | 北京工业大学 | Named entity recognition method and device and readable storage medium |
CN113190656A (en) * | 2021-05-11 | 2021-07-30 | 南京大学 | Chinese named entity extraction method based on multi-label framework and fusion features |
CN113536799A (en) * | 2021-08-10 | 2021-10-22 | 西南交通大学 | Medical named entity recognition modeling method based on fusion attention |
CN114580416A (en) * | 2022-03-01 | 2022-06-03 | 海南大学 | Chinese named entity recognition method and device based on multi-view semantic feature fusion |
US20220188520A1 (en) * | 2019-03-26 | 2022-06-16 | Benevolentai Technology Limited | Name entity recognition with deep learning |
CN114927177A (en) * | 2022-05-27 | 2022-08-19 | 浙江工业大学 | Medical entity identification method and system fusing Chinese medical field characteristics |
-
2022
- 2022-09-01 CN CN202211063791.3A patent/CN115146644B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109165384A (en) * | 2018-08-23 | 2019-01-08 | 成都四方伟业软件股份有限公司 | A kind of name entity recognition method and device |
US20220188520A1 (en) * | 2019-03-26 | 2022-06-16 | Benevolentai Technology Limited | Name entity recognition with deep learning |
CN112434520A (en) * | 2020-11-11 | 2021-03-02 | 北京工业大学 | Named entity recognition method and device and readable storage medium |
CN113190656A (en) * | 2021-05-11 | 2021-07-30 | 南京大学 | Chinese named entity extraction method based on multi-label framework and fusion features |
CN113536799A (en) * | 2021-08-10 | 2021-10-22 | 西南交通大学 | Medical named entity recognition modeling method based on fusion attention |
CN114580416A (en) * | 2022-03-01 | 2022-06-03 | 海南大学 | Chinese named entity recognition method and device based on multi-view semantic feature fusion |
CN114927177A (en) * | 2022-05-27 | 2022-08-19 | 浙江工业大学 | Medical entity identification method and system fusing Chinese medical field characteristics |
Non-Patent Citations (1)
Title |
---|
王月 等: "基于BERT的警情文本命名实体识别", 《计算机应用》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116050418A (en) * | 2023-03-02 | 2023-05-02 | 浙江工业大学 | Named entity identification method, device and medium based on fusion of multi-layer semantic features |
CN116050418B (en) * | 2023-03-02 | 2023-10-31 | 浙江工业大学 | Named entity identification method, device and medium based on fusion of multi-layer semantic features |
Also Published As
Publication number | Publication date |
---|---|
CN115146644B (en) | 2022-11-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109800310B (en) | Electric power operation and maintenance text analysis method based on structured expression | |
CN110532542B (en) | Invoice false invoice identification method and system based on positive case and unmarked learning | |
JP4568774B2 (en) | How to generate templates used in handwriting recognition | |
Nguyen et al. | Distinguishing antonyms and synonyms in a pattern-based neural network | |
CN107729313B (en) | Deep neural network-based polyphone pronunciation distinguishing method and device | |
CN107797987B (en) | Bi-LSTM-CNN-based mixed corpus named entity identification method | |
US11055327B2 (en) | Unstructured data parsing for structured information | |
CN108763510A (en) | Intension recognizing method, device, equipment and storage medium | |
CN110826335A (en) | Named entity identification method and device | |
CN113742733B (en) | Method and device for extracting trigger words of reading and understanding vulnerability event and identifying vulnerability type | |
CN112395421B (en) | Course label generation method and device, computer equipment and medium | |
CN115146644B (en) | Alarm situation text-oriented multi-feature fusion named entity identification method | |
CN114756675A (en) | Text classification method, related equipment and readable storage medium | |
CN116502628A (en) | Multi-stage fusion text error correction method for government affair field based on knowledge graph | |
CN116127953A (en) | Chinese spelling error correction method, device and medium based on contrast learning | |
CN115269834A (en) | High-precision text classification method and device based on BERT | |
CN107992468A (en) | A kind of mixing language material name entity recognition method based on LSTM | |
CN114925702A (en) | Text similarity recognition method and device, electronic equipment and storage medium | |
CN115098673A (en) | Business document information extraction method based on variant attention and hierarchical structure | |
CN114764566A (en) | Knowledge element extraction method for aviation field | |
CN113220964B (en) | Viewpoint mining method based on short text in network message field | |
CN111078874B (en) | Foreign Chinese difficulty assessment method based on decision tree classification of random subspace | |
CN112347783A (en) | Method for identifying types of alert condition record data events without trigger words | |
CN111523311A (en) | Search intention identification method and device | |
CN115309899B (en) | Method and system for identifying and storing specific content in text |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |