CN115146644A - Multi-feature fusion named entity identification method for warning situation text - Google Patents

Multi-feature fusion named entity identification method for warning situation text Download PDF

Info

Publication number
CN115146644A
CN115146644A CN202211063791.3A CN202211063791A CN115146644A CN 115146644 A CN115146644 A CN 115146644A CN 202211063791 A CN202211063791 A CN 202211063791A CN 115146644 A CN115146644 A CN 115146644A
Authority
CN
China
Prior art keywords
text
character
features
alarm
label
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211063791.3A
Other languages
Chinese (zh)
Other versions
CN115146644B (en
Inventor
徐同阁
王昊旻
杨立群
刘连忠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beihang University
Original Assignee
Beihang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beihang University filed Critical Beihang University
Priority to CN202211063791.3A priority Critical patent/CN115146644B/en
Publication of CN115146644A publication Critical patent/CN115146644A/en
Application granted granted Critical
Publication of CN115146644B publication Critical patent/CN115146644B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Character Discrimination (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to the technical field of named entity recognition of natural language processing, in particular to a warning situation text-oriented multi-feature fusion named entity recognition method, which comprises the steps of firstly, constructing a data set for warning situation named entity recognition, defining an entity type to be recognized, and dividing the entity type into a training set, a verification set and a test set; secondly, character features of the text are obtained by using the pre-trained word vectors, the text matching is carried out on the basis of rules and a dictionary to obtain pre-identified label features, and the text is subjected to pinyin conversion to obtain pinyin features; finally, fusing the three characteristics, and sending the three characteristics into a bidirectional long and short term memory network-conditional random field model for named entity recognition; the invention constructs a multi-feature fused Chinese named entity recognition method, effectively represents the ambiguity of characters by fusing text character features, pre-recognized label features and pinyin features, and improves the accuracy rate, the recall rate and the comprehensive evaluation index F1 value of the recognition of the warning text named entity.

Description

Alarm situation text-oriented multi-feature fusion named entity identification method
Technical Field
The invention relates to the technical field of named entity recognition of natural language processing, in particular to a multi-feature fusion named entity recognition method for an alarm situation text.
Background
Named entity recognition is one of subtasks of information extraction, and the main purpose of the named entity recognition is to recognize entities with specific meanings in texts, such as names of people, places, organizations, proper nouns, and classify the entities, and is one of upstream tasks of natural language processing related tasks of text understanding, machine translation, question and answer systems, and the like. The core of the alarm named entity recognition is to effectively recognize various alarm elements such as a case-sending place, a case-making means and the like, provide powerful data support for a downstream alarm analysis task, and is the key of constructing alarm knowledge maps, series-parallel analysis, abnormal early warning, high-rate early warning and other tasks. How to accurately identify and extract effective named entities from the warning situation text becomes one of the fundamental and critical jobs in the field of warning situation analysis and mining.
Although there are many mature models and algorithms to solve the named entity recognition problem, for example, CN111460820B discloses a network space security domain named entity recognition method and apparatus based on a pre-trained model BERT. The method comprises the steps that an input text in the network space security field is subjected to word segmentation preprocessing by using a word segmentation device WordPiece of a BERT model, all tokens obtained through word segmentation preprocessing are loaded into the BERT model to be trained, output vector representation is obtained, then the output vector representation is sent to a Highway network and a classifier, dimensions represented by the vectors of the tokens are mapped to dimensions consistent with the number of labels, and final vector representation of the tokens is obtained; and finally, calculating loss by using a cross entropy loss function only by using the first token of each word, and reversely propagating the loss to update the model parameters to obtain the trained named entity recognition model for recognizing the named entity in the security field.
CN111460824B discloses a label-free named entity identification method based on anti-migration learning, to construct a label-free named entity identification model, first, inputting a text of a source field or a target field and mapping the text into a word embedding vector; inputting the word embedding vector into a bidirectional long-short term memory network to extract a characteristic vector; inputting the characteristic vector into a countermeasure discriminator, and mapping the data of the source field and the data of the target field to the same data distribution space; inputting the feature vector into a conditional random field, calculating the probability of all possible label sequences of the input text, and selecting a label with the maximum probability as a final predicted label; obtaining the optimal model parameters by jointly training the named entity recognition task and the confrontation training task; inputting data of a target field, and outputting a prediction label through a CRF (conditional random field) layer.
As mentioned above, although there are mature models and algorithms for solving the named entity recognition problem, the general named entity recognition model and algorithm are not good for the named entity recognition in a specific field (e.g. alarm analysis), and there are some challenges and difficulties in alarm named entity recognition compared to the named entity recognition research in the general field. Firstly, the current warning situation field lacks high-quality labeled data, so that a warning situation named entity is required to be customized in the warning situation field, and a professional data set is labeled for recognition of the warning situation named entity. The sentence patterns and grammars of the warning situation texts often have certain fixed characteristics, but the existing named entity recognition pursues an end-to-end neural network model, ignores the relevant knowledge and characteristics of the field, separates the entity recognition model from the relevant rules and dictionaries of the field, and does not effectively fuse the external knowledge and the model. The alarm situation text is the report information recorded by the alarm receiving policeman according to the description of the alarm, is usually not standard, has heavy spoken language, and may contain errors such as wrongly written characters, homophones (such as place names, person names and the like) and the like caused by pronunciation problems, and the current named entity recognition model does not consider the pinyin characteristics of the text.
Disclosure of Invention
Aiming at the problems that high-quality labeling data and an effective named entity recognition method are lacked in the field of warning situations at present, the invention provides a warning situation text-oriented multi-feature fusion named entity recognition method, wherein text character features, label features and pinyin features which are pre-recognized through rules and dictionaries are fused to jointly act on a warning situation named entity recognition model, and the warning situation named entity recognition model is used on a self-built warning situation text data set, so that the accuracy rate, the recall rate and the comprehensive evaluation index F1 value of the warning situation named entity recognition are effectively improved.
The specific technical scheme of the invention is as follows:
a multi-feature fusion named entity recognition method for an alert text comprises the following steps:
step S1: extracting the alarm text information, classifying according to the alarm cases, and extracting corresponding types of alarm texts from an alarm database;
step S2: constructing a named entity identification data set of an alarm situation text, and dividing the named entity identification data set into a training set, a verification set and a test set;
and step S3: extracting character features, namely acquiring a character vector corresponding to each character of the data in the entity identification data set as character features aiming at the data in the entity identification data set;
and step S4: extracting tag characteristics, namely defining respective rules and dictionaries for various entities, performing character string matching by using the rules and the dictionaries, identifying an alarm text, performing vectorization representation on the obtained identification tag, and taking the vectorization representation of the tag as the tag characteristics;
step S5: the method comprises the steps of pinyin feature extraction, wherein the pinyin of each character in an alarm text is obtained and vectorized, and the vectorized representation of the pinyin is used as the pinyin feature of the character;
step S6: multi-feature fusion, namely fusing three feature vectors of character features, label features and pinyin features; the multi-feature fusion adopts one of direct splicing fusion, additive fusion or splicing fusion after feature extraction;
step S7: model training, namely constructing a multi-feature fusion named entity recognition model, extracting character features, label features and pinyin features from training set data, inputting the extracted character features, label features and pinyin features into a bidirectional long-short term memory network, and capturing constraints and dependency relations among labels by using a conditional random field;
step S8: model testing, namely sending test set data to a multi-feature fusion named entity recognition model to obtain a prediction label, comparing the prediction label with an actual label, calculating the number of correct and wrong detections in a test sample, and solving a recognition accuracy P, a recall rate R and a comprehensive evaluation index F1 value, wherein the calculation mode of the comprehensive evaluation index F1 value is as follows:
Figure 100002_DEST_PATH_IMAGE001
step S9: and (4) carrying out named entity recognition on the rest unmarked warning situation texts in the warning situation database by using the multi-feature fusion named entity recognition model trained in the step (S7).
In step S2, the specific steps of constructing the alert naming entity identification data set are as follows:
step S201: data cleaning, namely cleaning the data of the alarm condition text, and removing abnormal symbols, messy codes and repeated data;
step S202: entity definition, namely self-defining 6 types of entities, including names of alarm persons, addresses of alarm cases, case-related articles, case properties, amount of case-related property and license plate numbers of case-related objects;
step S203: entity marking, namely manually marking the warning situation text according to a self-defined 6-type entity by adopting a BIOES marking standard of word granularity;
step S204: and (3) data division, namely dividing the marked data into a training set, a verification set and a test set according to a certain proportion.
S3, the specific steps of the character feature extraction process are as follows:
step S301: training on an Baidu encyclopedia corpus by adopting a text vectorization tool Word2Vec to obtain pre-trained Word vectors, wherein the Word vectors comprise character vectors and Word vectors;
step S302: for each piece of data in the named entity identification data set constructed in the step S2, sequence definition is carried out on characters of each piece of data to obtain a text sequence S of each piece of data;
step S303: for each character in the text sequence S, a character vector corresponding to each character is determined as a character feature according to the word vector pre-trained in step S301.
In step S4, generating a pre-identified tag feature, specifically including the following steps:
step S401: defining different rules or dictionaries according to different entities, and preliminarily identifying the content of the warning text;
step S402: the entity identification of the alarm person, the text before the character 'newspaper' in the alarm situation text is marked as the entity of the alarm person,
step S403: identifying an address entity, constructing an ending word dictionary, and marking words behind words and words before ending words in the warning text as the address entity through character string matching;
step S404: identifying the entity of the involved case amount, and identifying the involved case amount in the warning situation text by using a regular expression;
step S405: identifying a vehicle license plate entity, namely identifying a license plate number entity in the warning text by using a regular expression;
step S406: identifying case property entities, namely dividing case types into 5 grades through analysis and statistics of an alarm text, wherein 354 case types are totally used, different types of cases have different vocabularies to describe the properties of the cases, and constructing a case property dictionary to match the case property entities;
step S407: identifying the entity of the involved articles, constructing a dictionary of the involved articles, and identifying by adopting character string matching;
step S408: label representation, for the identified entity, the first word is represented by "B-label", the middle character is represented by "I-label", the last word is represented by "E-label", and the rest of the characters not matched by the rule are represented by label "O";
step S409: after the above rules and dictionary recognition, for text sequencesS,Each character has a corresponding label to obtain a label sequence L;
step S410: vectorization representation, namely constructing a tag embedded lookup table and carrying out vectorization representation on the identified tags; a vectorized representation of the tag is taken as a tag feature.
In step S5, the obtaining of the pinyin features of the characters includes the following steps:
step S501: obtaining pinyin representations of different characters, wherein the tone of each character is placed behind the pinyin of the character;
step S502: vectorization expression, namely constructing a pinyin query table, vectorizing the pinyin, and taking the vectorization expression of the pinyin as the pinyin characteristics of the characters.
For the multi-feature fusion described in step S6, the steps are as follows:
step S601: the character characteristics, the label characteristics and the pinyin characteristics of each character are fused, and three fusion modes are designed as follows:
1) The direct splicing and fusion mode comprises the following steps: directly splicing the feature vectors in the three forms to form a final vector;
2) The adding and fusing mode is that the label characteristic, the pinyin characteristic and the character embedding dimension are the same, and then the values of the corresponding positions of each vector are directly added to form a final vector;
3) And after the features are extracted, splicing and fusing, extracting and splicing the three features by using different bidirectional long-term and short-term memory networks, and forming a final vector by using the extracted features.
In step S7, the multi-feature fusion vector obtained in step 6 is sent to a BiLSTM-CRF model for training, which may be specifically expressed as:
step S701: updating a forgetting gate, an input gate and an output gate of the long-short term memory network, and updating a state unit according to the updated forgetting gate, the input gate and the output gate to generate a next implicit vector state;
step S702: obtaining a context vector of each character by adopting a BilSTM model; processing the input sequence from the front direction and the back direction;
step S703: constructing a conditional random field, and capturing the dependency relationship among the labels to obtain a predicted label sequence;
and step S704, adjusting the model to the current optimal value by using the verification data set, and storing the current optimal model.
In step S8, model testing is carried out, test set data are sent to a multi-feature fusion named entity recognition model to obtain a prediction tag, the prediction tag is compared with an actual tag, the number of correct and false detections in a test sample is calculated, and the recognition accuracy P, the recall ratio R and the comprehensive evaluation index F1 value are obtained, and the specific steps are as follows:
step S801: sending the test set data to a trained multi-feature fusion model, and obtaining an identified label for each input sequence;
step S802: comparing the labels predicted by the model with the real labels, and counting the correct entity number predicted by the model, the total entity number predicted by the model and the total entity number in the data set;
step S803: and calculating the detection accuracy P, the recall rate R and the comprehensive evaluation index F1 value.
Compared with the prior art, the invention has the following beneficial effects:
aiming at the problem that the warning situation field is lack of high-quality labeling data, the invention defines 6 types of warning situation named entities, constructs a corpus for recognizing the warning situation named entities through manual labeling, designs a multi-feature fusion UMF-BilSTM-CRF model for recognizing the warning situation named entities, effectively improves the accuracy of recognizing the warning situation named entities, and has a comprehensive evaluation index F1 which can reach 93.3 percent at most and is 1.56 percent higher than that of the existing model.
The invention uses the character vector pre-trained on the Baidu encyclopedia corpus as the character feature, and improves the hot start capability of the network, accelerates the model training speed and further improves the overall recognition performance of the model compared with the method of directly using self-labeling field data for training.
The invention fully excavates the characteristics of the warning situation text, identifies the corresponding entity in advance by defining different rules and dictionaries, takes the identified label as the characteristic, adds the phonetic feature and enriches the representation of each character. And three feature fusion modes are provided, and character features, label features and voice features are fully fused. The experimental result verifies the effectiveness of the method. The method can be used for identifying the unmarked warning situation named entity.
Drawings
In order to clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following embodiments or the description in the prior art will be briefly described using the accompanying drawings, which are only used for illustration and should not be construed as limiting the present invention.
FIG. 1 is a schematic flow chart of the method of the present invention.
FIG. 2 is a schematic diagram of a direct splicing fusion method.
FIG. 3 is a schematic diagram of an additive fusion approach.
FIG. 4 is a schematic diagram of a splicing and merging method after extracting features of a bidirectional long-term and short-term memory network.
FIG. 5 is a diagram of a named entity recognition model with multi-feature fusion.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
In the present embodiment, unless otherwise specified, the sequence representation used includes a set of a plurality of objects, such as a text sequenceS = {c 1 ,c 2 ,…,c m }c 1 Representing a first object in a sequence,c 2 Representing the second object in the sequence, and so on, in this documentc 1 First in the finger sequenceiThe same applies to each subject, as follows.
The invention provides a multi-feature fusion named entity recognition method for an alarm text, which fuses rules and label features and pinyin features pre-recognized by a dictionary, and the flow diagram is shown in figure 1, and the method comprises the following steps:
step S1: extracting the alarm text information, wherein the classification of the alarm cases is a predefined hierarchical progressive structure, the classification is divided into at most 5 levels from coarse to fine, the finer case types are divided under each case type according to the service requirements, and the alarm texts covering all the first 2 case types are randomly extracted from an alarm database.
Step S2: and constructing a named entity recognition data set of the alarm situation text, and dividing the named entity recognition data set into a training set, a verification set and a test set.
In an optional embodiment of the present invention, the step S2 of constructing the alarm named entity identification data set specifically includes the following sub-steps:
step S201: data cleaning, namely simply cleaning the alarm text to remove abnormal symbols, messy codes and repeated data;
step S202: entity definition, namely self-defining 6 types of entities, including alarm person name (PER), alarm case address (LOC), case-related object (PRO), case property (CT), amount of money of case-related property (MON) and number plate number (CAR) of case-related object, wherein the marking specification of the alarm case named entity is shown in table 1;
TABLE 1 labeling Specifications for alert named entities
Figure 399944DEST_PATH_IMAGE002
Step S203: and (4) entity labeling, namely manually labeling the warning situation text according to the customized 6 types of entities. The BIOES marking standard of word granularity is adopted.
Step S204: and (3) data division, namely dividing the marked data into a training set, a verification set and a test set, wherein 2000 training sets, 295 verification sets and 277 test sets approximately satisfy 8:1:1, in the distribution.
And step S3: and extracting character features, and vectorizing characters in the warning situation text to be used as character features.
In an optional embodiment of the present invention, the step S3 obtains the character features, and the basic process may be summarized as follows: firstly, aiming at each sentence in the warning situation text, extracting characters of the sentence and vectorizing the characters, wherein each vector is a multidimensional digital expression, and the vectorized characters are extracted as character features, and the method specifically comprises the following steps:
step S301: in the embodiment, word2Vec is used, and training is performed on an Baidu encyclopedia corpus to obtain pre-trained Word vectors, wherein the Word vectors comprise character vectors and Word vectors;
step S302: obtaining the dimensionality, the embedding matrix, the dictionary size, a dictionary-index table and an index-dictionary table of the word vector according to pre-training;
step S303: for each piece of data in the data set constructed in step S2, a text sequence is definedS = {c 1 , c 2 ,…,c m }Wherein the text sequenceSIn the step (1), the first step,C i represents the first in the sequenceiThe number of the characters is one,mrepresents the length of a sentence;
step S304: obtaining the corresponding character vector of each character according to the dictionary-index table and the embedded matrixe i c As character features.
And step S4: and extracting tag features, namely defining different rules and dictionaries for different entities, matching character strings by using the rules and the dictionaries, carrying out primary recognition on the warning situation text, and taking the obtained recognition tag as the tag features.
In an optional embodiment of the present invention, the step S4 of generating the pre-identified tag feature specifically includes the following sub-steps:
step S401: defining different rules or dictionaries according to different entities, and preliminarily identifying the content of the warning text;
step S402: the alarm entity is identified, and the alarm entity follows the principle that the text before the 'newspaper' word is the alarm entity. Thus the text before the "newspaper" word is labeled as the alarmer entity;
step S403: address entity recognition, the address entity follows the principle that the ' after ' word is the beginning of the address entity and usually ends with some words describing the address, so that an ending word dictionary can be constructed, and words after the ' word and words before the ending word are labeled as the address entity through character string matching;
step S404: identifying the involved money entity, wherein the involved money entity is usually started with numbers and ended with 'Yuan' characters, and the initial identification can be carried out by utilizing a regular expression;
step S405: the identification of the entity of the involved license plate is realized by combining the abbreviation of province and 6-digit English and numbers for the license plate number, so that the entity of the license plate number in the warning text can be identified by utilizing a regular expression;
step S406: case property entity recognition, wherein case types are totally divided into 5 grades through analysis and statistics of an alarm text, 354 case types are totally adopted, and cases of different types have different vocabularies for describing properties, such as 'stolen', 'robbed', 'dispute', and the like, so that a case property dictionary can be constructed for matching;
step S407: identifying the entity of the involved articles, constructing a common involved article dictionary, and identifying by adopting character string matching;
step S408: tag representation, for the identified entity, the first word is represented by a "B-tag", the middle character is represented by an "I-tag", and the last word is represented by an "E-tag". And the rest characters which are not matched by the rule are represented by a label 'O';
step S409: after the above rules and dictionary recognition, for text sequencesS = {c 1 ,c 2 ,…,c m },Each character has a corresponding label to obtain a label sequenceL = {label 1 ,label 2 ,…,label m }(ii) a In the above tag sequenceLIn (1),label i is shown asiA label of individual characters.
Step S410: vectorized representation, building tag-embedded look-up tablese label And vectorizing and representing the identified label. For the firstiVector quantized characterC i The corresponding label is vectorized and expressed ase t i =e label (label i )(ii) a Tagging vectorizatione t i As a label for each characterAnd (6) signature characteristics.
Step S5: pinyin feature extraction, namely taking the pinyin representation of each character as the pinyin feature of the character;
in an optional embodiment of the present invention, the obtaining of the pinyin characteristics of the character in the step S5 specifically includes the following sub-steps:
step S501: obtaining pinyin representation of different characters by using a pypinyin packet carried by Python, wherein the tone of each character is placed behind the pinyin of the character, and if the pinyin of a 'wind' character is represented as 'feng1';
step S502: vectorized representation, construction of phonetic query tablee pinyin The label is vectorized to express, for the character 'wind', the pinyin of the character 'feng1', and the obtained vectorized pinyin is expressed as
Figure DEST_PATH_IMAGE003
. Vectorizing the pinyin
Figure 254768DEST_PATH_IMAGE004
As the phonetic features of the character.
Step S6: multi-feature fusion, namely fusing three feature vectors of character features, label features and pinyin features;
in an optional embodiment of the present invention, the step S6 performs multi-feature fusion, and proposes three fusion manners, and schematic diagrams can be seen in fig. 2 to 4, and specifically includes the following sub-steps:
step S601:
Figure DEST_PATH_IMAGE005
Figure 972188DEST_PATH_IMAGE006
Figure DEST_PATH_IMAGE007
are respectively characters
Figure 305081DEST_PATH_IMAGE008
Character features, label features, and pinyin features. The three characteristics are fused.Designing three fusion modes;
1) The localization direct splicing fusion mode directly splices the feature vectors of the three forms, and then for the characters
Figure DEST_PATH_IMAGE009
In terms of the final fused vector
Figure 57136DEST_PATH_IMAGE010
Figure DEST_PATH_IMAGE011
A splice is indicated. If the character feature is embedded in the dimension ofcThe label embedding dimension istThe dimension of pinyin embedding ispThen the dimension of the final embedded vector isc+t+p
2) And in the Add addition fusion mode, the condition of consistent dimension is required to be met by using an Add direct addition mode, so that the label and the Pinyin feature are required to be the same as the dimension of character embedding, and then the values of the corresponding positions of each vector are directly added. The resulting vector is represented as
Figure 817282DEST_PATH_IMAGE012
3) And after LSTM extracts features, splicing and fusing. Extracting three characteristics by using different bidirectional long-short term memory networks, and splicing the extracted characteristics to form a final vector:
Figure DEST_PATH_IMAGE013
(ii) a For each character
Figure 45435DEST_PATH_IMAGE014
After the direct splicing and fusion mode by localization, the feature vector can be expressed as
Figure DEST_PATH_IMAGE015
Step S7: model training, namely constructing a multi-Feature fusion named entity recognition model UMF-BilSTM-CRF model, wherein UMF represents UnfiedMulti-Feature, namely unified multi-Feature, inputting three features extracted from training set data into a bidirectional long-short term memory network (BilSTM), and capturing constraint and dependency relationship among labels by using a Conditional Random Field (CRF);
in an optional embodiment of the present invention, the multi-feature fusion vector obtained in step S7 is fed into a BiLSTM-CRF model for training, and a model schematic diagram can be shown in fig. 5, which specifically includes the following sub-steps:
step S701: updating the gate structure of the long-short term memory network (LSTM), and the concrete formula is as follows:
(1) update the forget gate of LSTM:
Figure 182019DEST_PATH_IMAGE016
wherein, the first and the second end of the pipe are connected with each other,
Figure 382056DEST_PATH_IMAGE017
is a forgetting gate of the LSTM,
Figure DEST_PATH_IMAGE018
is a function of the Sigmoid and is,
Figure 313103DEST_PATH_IMAGE019
in order to forget the weight matrix of the gate,
Figure DEST_PATH_IMAGE020
in order to input the vector, the vector is input,
Figure 5115DEST_PATH_IMAGE021
is the implicit state of the LSTM at a time prior to the current time t,
Figure DEST_PATH_IMAGE022
a bias term for a forget gate;
(2) input gate to update LSTM:
Figure 945390DEST_PATH_IMAGE023
wherein, the first and the second end of the pipe are connected with each other,
Figure DEST_PATH_IMAGE024
is an input gate for the LSTM and,
Figure 406458DEST_PATH_IMAGE025
is the weight matrix of the input gate,
Figure 242827DEST_PATH_IMAGE026
is the bias term of the input gate;
(3) update output gate of LSTM:
Figure DEST_PATH_IMAGE027
wherein, the first and the second end of the pipe are connected with each other,
Figure 422135DEST_PATH_IMAGE028
is an output gate of the LSTM and,
Figure DEST_PATH_IMAGE029
is a weight matrix of the output gates,
Figure 962838DEST_PATH_IMAGE030
is the bias term of the output gate;
(4) updating cell states
Figure DEST_PATH_IMAGE031
Figure 543992DEST_PATH_IMAGE032
Wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE033
is a forgetting gate of the LSTM,
Figure 816842DEST_PATH_IMAGE034
the state of the cell at the last moment,
Figure DEST_PATH_IMAGE035
is a unitThe weight matrix of the state is then determined,
Figure 155550DEST_PATH_IMAGE036
a bias term that is a cell state;
(5) generating the next implicit vector state:
Figure DEST_PATH_IMAGE037
wherein the content of the first and second substances,
Figure 499944DEST_PATH_IMAGE038
for the implicit state of LSTM at the current time t,
Figure DEST_PATH_IMAGE039
in the state of the cell, it is,
Figure 935605DEST_PATH_IMAGE040
is a hyperbolic tangent function;
step S702: the present embodiment uses the BiLSTM model to obtain a context vector for each character. The BilSTM, a bidirectional LSTM model, can process an input sequence from front and back directions, and uses the method for a given sentence
Figure DEST_PATH_IMAGE041
Represents at the timetThe forward direction output is carried out,
Figure 644935DEST_PATH_IMAGE042
indicating a backward output, thentAt time, the implicit state of LSTM is taken as output and expressed as:
Figure DEST_PATH_IMAGE043
step S703: and (3) constructing a conditional random field, and capturing the dependency relationship among the labels to obtain a predicted label sequence.
(1) Computing an emission matrix
Figure 533256DEST_PATH_IMAGE044
Wherein the content of the first and second substances,Oin order to be a transmit matrix,
Figure DEST_PATH_IMAGE045
this is the state sequence of the BilSTM output.
(2) For a tag sequence
Figure 619024DEST_PATH_IMAGE046
Its score is defined as:
Figure DEST_PATH_IMAGE047
(7)
whereinOIs the transmit matrix, i.e., the fractional value of each class of tag prediction,
Figure 643612DEST_PATH_IMAGE048
denotes the firstiThe character is predicted as a label
Figure DEST_PATH_IMAGE049
The probability of (a) of (b) being,
Figure 523843DEST_PATH_IMAGE050
is a transition probability matrix, representing the transition probabilities between label categories,
Figure DEST_PATH_IMAGE051
indicating label
Figure 899461DEST_PATH_IMAGE052
Is transferred to
Figure DEST_PATH_IMAGE053
The probability of (a) of (b) being,
Figure 523340DEST_PATH_IMAGE054
representing all possible tag sequences.
(3) The loss function during training L is:
Figure DEST_PATH_IMAGE055
(8)
Figure 933593DEST_PATH_IMAGE056
is a sequence of sentences, and the sentence sequence,
Figure DEST_PATH_IMAGE057
the tags that are the corresponding occurrences of the sentence,
Figure 984725DEST_PATH_IMAGE058
representing sentences
Figure DEST_PATH_IMAGE059
Is marked as a tag sequence
Figure 847639DEST_PATH_IMAGE060
The probability of (c).
In the training process, the model is trained by minimizing the negative log-likelihood probability at the sentence level.
(4) In prediction, the Viterbi algorithm is used to find out the label sequence with the highest score
Figure DEST_PATH_IMAGE061
:
Figure 212893DEST_PATH_IMAGE062
(9)
And step S704, adjusting the model to be optimal according to the comprehensive evaluation index F1 value by using the verification data set, and storing the optimal model M1.
Step S8: model testing, namely sending test set data to a model to obtain a prediction label, comparing the prediction label with an actual label to calculate the number of correct/wrong detections of a test sample, and solving detection accuracy, recall rate and a comprehensive evaluation index F1 value;
in an optional embodiment of the present invention, the step S7 of performing model testing, sending the test set data to the model to obtain a prediction label, comparing the prediction label with the actual label to calculate the number of correct/incorrect detections of the test sample, and obtaining the detection accuracy, precision, and recall ratio specifically includes the following sub-steps:
step S801: sending the test set data to a trained multi-feature fusion model, and inputting each sequenceS = {c 1 ,c 2 ,…,c m }Label y = { y) that will result in model prediction 1 ,y 2 ,…,y m };
Step S802: label y = { y ] predicted by comparative model 1 ,y 2 ,…,y m -and true labels, counting the number of entities the model predicts correctly, the total number of entities the model predicts, the total number of entities in the dataset;
step S803: the detection accuracy P (Precision), recall R (Recall) and integrated evaluation index F1 values were calculated according to the following formulas:
precision = number of entities correct for model prediction/total number of entities predicted by model
Recall = number of entities correctly predicted by model/total number of entities in dataset
Figure DEST_PATH_IMAGE063
The test results of the present invention are shown in table 2:
TABLE 2 recognition effect of alarm named entities of different models
Figure 212073DEST_PATH_IMAGE064
The invention compares the named entity recognition results of 5 methods on the warning situation text, compares the accuracy, the recall rate and the comprehensive evaluation index F1 value of different models, and can see that the multi-feature fusion named entity recognition model based on the proposed method has the best effect from the table 2, wherein the accuracy reaches 95.91%, the recall rate reaches 90.92%, and the comprehensive evaluation index F1 value reaches 93.30%. The experimental results prove the correctness and effectiveness of the method.
Step S9: and carrying out named entity recognition on the rest unmarked warning condition texts in the warning condition database by using the trained UMF-BilSTM-CRF.
While the invention has been described with reference to specific preferred embodiments, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (8)

1. A multi-feature fusion named entity identification method for an alarm situation text is characterized by comprising the following steps:
step S1: extracting the alarm text information, classifying according to the alarm cases, and extracting corresponding types of alarm texts from an alarm database;
step S2: constructing a named entity identification data set of an alarm situation text, and dividing the named entity identification data set into a training set, a verification set and a test set;
and step S3: extracting character features, namely acquiring a character vector corresponding to each character of the data in the entity identification data set as character features aiming at the data in the entity identification data set;
and step S4: extracting tag characteristics, namely defining respective rules and dictionaries for various entities, performing character string matching by using the rules and the dictionaries, identifying an alarm text, performing vectorization expression on the obtained identification tags, and taking the vectorization expression of the tags as tag characteristics;
step S5: pinyin feature extraction, namely acquiring pinyin of each character in the warning situation text, performing vectorization expression, and taking the vectorization expression of the pinyin as the pinyin feature;
step S6: multi-feature fusion, namely fusing three feature vectors of character features, label features and pinyin features; the multi-feature fusion adopts one of direct splicing fusion, additive fusion or splicing fusion after feature extraction;
step S7: model training, namely constructing a multi-feature fusion named entity recognition model, extracting character features, label features and pinyin features from training set data, inputting the extracted character features, label features and pinyin features into a bidirectional long-short term memory network, and capturing constraints and dependency relations among labels by using a conditional random field;
step S8: model testing, namely sending test set data to a multi-feature fusion named entity recognition model to obtain a prediction label, comparing the prediction label with an actual label, calculating the number of correct and wrong detections in a test sample, and solving a recognition accuracy P, a recall rate R and a comprehensive evaluation index F1 value, wherein the calculation mode of the comprehensive evaluation index F1 value is as follows:
Figure DEST_PATH_IMAGE001
step S9: and (4) carrying out named entity recognition on the rest unmarked warning situation texts in the warning situation database by using the multi-feature fusion named entity recognition model trained in the step (S7).
2. The method for recognizing the alarm text-oriented multi-feature fusion named entity as claimed in claim 1, wherein in step S2, the specific steps of constructing the alarm named entity recognition data set are as follows:
step S201: data cleaning, namely cleaning the data of the alarm condition text, and removing abnormal symbols, messy codes and repeated data;
step S202: entity definition, namely self-defining 6 types of entities, including names of alarm persons, addresses of alarm cases, case-related articles, case properties, amount of case-related property and license plate numbers of case-related objects;
step S203: entity marking, namely manually marking the warning situation text according to a self-defined 6-type entity by adopting a BIOES marking standard of word granularity;
step S204: and (3) data division, namely dividing the marked data into a training set, a verification set and a test set according to a certain proportion.
3. The method for recognizing the alarm text-oriented multi-feature fusion named entity as claimed in claim 2, wherein the step S3 of extracting the character features comprises the following specific steps:
step S301: training on an Baidu encyclopedia corpus by adopting a text vectorization tool Word2Vec to obtain pre-trained Word vectors, wherein the Word vectors comprise character vectors and Word vectors;
step S302: for each piece of data in the named entity identification data set constructed in the step S2, sequence definition is carried out on characters of each piece of data to obtain a text sequence S of each piece of data;
step S303: for each character in the text sequence S, a character vector corresponding to each character is determined as a character feature according to the word vector pre-trained in step S301.
4. The method for recognizing the alarm text-oriented multi-feature fusion named entity according to claim 3, wherein the step S4 of generating the pre-recognized tag feature specifically comprises the following steps:
step S401: defining different rules or dictionaries according to different entities, and preliminarily identifying the content of the warning situation text;
step S402: the alarm person entity is identified, the text before the alarm word in the alarm situation text is marked as the alarm person entity,
step S403: identifying address entities, constructing a final word dictionary, and marking words behind the words and words before the final words in the warning text as the address entities through character string matching;
step S404: identifying the entity of the involved case amount, and identifying the involved case amount in the warning situation text by using a regular expression;
step S405: identifying a vehicle license plate entity, namely identifying a license plate number entity in the warning text by using a regular expression;
step S406: identifying case property entities, namely dividing case types into 5 grades through analysis and statistics of an alarm text, wherein 354 case types are totally used, different types of cases have different vocabularies to describe the properties of the cases, and constructing a case property dictionary to match the case property entities;
step S407: identifying the entity of the involved articles, constructing a dictionary of the involved articles, and identifying by adopting character string matching;
step S408: label representation, for the identified entity, the first word is represented by "B-label", the middle character is represented by "I-label", the last word is represented by "E-label", and the rest of the characters not matched by the rule are represented by label "O";
step S409: after the above rules and dictionary recognition, for text sequencesS,Each character has a corresponding label to obtain a label sequence L;
step S410: vectorization representation, namely constructing a tag embedded lookup table and carrying out vectorization representation on the identified tags; a vectorized representation of the tag is taken as a tag feature.
5. The method for recognizing the alarm text-oriented multi-feature fusion named entity as claimed in claim 4, wherein in the step S5, the step of obtaining the pinyin features of the characters comprises the steps of:
step S501: obtaining pinyin representations of different characters, wherein the tone of each character is placed behind the pinyin of the character;
step S502: vectorization expression, namely constructing a pinyin query table, vectorizing the pinyin, and taking the vectorization expression of the pinyin as the pinyin characteristics of the characters.
6. A method for recognizing a multi-feature fusion named entity oriented to an alert text as claimed in claim 5, wherein the multi-feature fusion in step S6 comprises the following steps:
step S601: the character features, the label features and the pinyin features of each character are fused, and three fusion modes are designed as follows:
1) The direct splicing and fusion mode is as follows: directly splicing the feature vectors in the three forms to form a final vector;
2) The adding and fusing mode is that the label characteristic, the pinyin characteristic and the character embedding dimension are the same, and then the values of the corresponding positions of each vector are directly added to form a final vector;
3) And after the features are extracted, splicing and merging, extracting and splicing the three features by using different bidirectional long-term and short-term memory networks, and forming a final vector by using the extracted features.
7. The method for recognizing the alarm text-oriented multi-feature fusion named entity of claim 6, wherein in step S7, the multi-feature fusion vector obtained in step 6 is fed into a BiLSTM-CRF model for training, which is specifically represented as:
step S701: updating a forgetting gate, an input gate and an output gate of the long-short term memory network, and updating a state unit according to the updated forgetting gate, the input gate and the output gate to generate a next implicit vector state;
step S702: obtaining a context vector of each character by adopting a BilSTM model; processing the input sequence from the front direction and the back direction;
step S703: constructing a conditional random field, and capturing the dependency relationship among the labels to obtain a predicted label sequence;
and step S704, adjusting the model to the current optimal value by using the verification data set, and storing the current optimal model.
8. The method for recognizing the multi-feature fusion named entity oriented to the alert text according to claim 7, wherein in the step S8, model testing is performed, test set data is sent to the multi-feature fusion named entity recognition model to obtain a prediction tag, the prediction tag is compared with an actual tag, the number of correct and false detections in a test sample is calculated, and a recognition accuracy P, a recall rate R and a comprehensive evaluation index F1 value are obtained, and the specific steps are as follows:
step S801: sending the test set data to a trained multi-feature fusion model, and obtaining an identified label for each input sequence;
step S802: comparing the labels predicted by the model with the real labels, and counting the correct entity number predicted by the model, the total entity number predicted by the model and the total entity number in the data set;
step S803: and calculating the detection accuracy P, the recall rate R and the comprehensive evaluation index F1 value.
CN202211063791.3A 2022-09-01 2022-09-01 Alarm situation text-oriented multi-feature fusion named entity identification method Active CN115146644B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211063791.3A CN115146644B (en) 2022-09-01 2022-09-01 Alarm situation text-oriented multi-feature fusion named entity identification method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211063791.3A CN115146644B (en) 2022-09-01 2022-09-01 Alarm situation text-oriented multi-feature fusion named entity identification method

Publications (2)

Publication Number Publication Date
CN115146644A true CN115146644A (en) 2022-10-04
CN115146644B CN115146644B (en) 2022-11-22

Family

ID=83416670

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211063791.3A Active CN115146644B (en) 2022-09-01 2022-09-01 Alarm situation text-oriented multi-feature fusion named entity identification method

Country Status (1)

Country Link
CN (1) CN115146644B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116050418A (en) * 2023-03-02 2023-05-02 浙江工业大学 Named entity identification method, device and medium based on fusion of multi-layer semantic features

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109165384A (en) * 2018-08-23 2019-01-08 成都四方伟业软件股份有限公司 A kind of name entity recognition method and device
CN112434520A (en) * 2020-11-11 2021-03-02 北京工业大学 Named entity recognition method and device and readable storage medium
CN113190656A (en) * 2021-05-11 2021-07-30 南京大学 Chinese named entity extraction method based on multi-label framework and fusion features
CN113536799A (en) * 2021-08-10 2021-10-22 西南交通大学 Medical named entity recognition modeling method based on fusion attention
CN114580416A (en) * 2022-03-01 2022-06-03 海南大学 Chinese named entity recognition method and device based on multi-view semantic feature fusion
US20220188520A1 (en) * 2019-03-26 2022-06-16 Benevolentai Technology Limited Name entity recognition with deep learning
CN114927177A (en) * 2022-05-27 2022-08-19 浙江工业大学 Medical entity identification method and system fusing Chinese medical field characteristics

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109165384A (en) * 2018-08-23 2019-01-08 成都四方伟业软件股份有限公司 A kind of name entity recognition method and device
US20220188520A1 (en) * 2019-03-26 2022-06-16 Benevolentai Technology Limited Name entity recognition with deep learning
CN112434520A (en) * 2020-11-11 2021-03-02 北京工业大学 Named entity recognition method and device and readable storage medium
CN113190656A (en) * 2021-05-11 2021-07-30 南京大学 Chinese named entity extraction method based on multi-label framework and fusion features
CN113536799A (en) * 2021-08-10 2021-10-22 西南交通大学 Medical named entity recognition modeling method based on fusion attention
CN114580416A (en) * 2022-03-01 2022-06-03 海南大学 Chinese named entity recognition method and device based on multi-view semantic feature fusion
CN114927177A (en) * 2022-05-27 2022-08-19 浙江工业大学 Medical entity identification method and system fusing Chinese medical field characteristics

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王月 等: "基于BERT的警情文本命名实体识别", 《计算机应用》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116050418A (en) * 2023-03-02 2023-05-02 浙江工业大学 Named entity identification method, device and medium based on fusion of multi-layer semantic features
CN116050418B (en) * 2023-03-02 2023-10-31 浙江工业大学 Named entity identification method, device and medium based on fusion of multi-layer semantic features

Also Published As

Publication number Publication date
CN115146644B (en) 2022-11-22

Similar Documents

Publication Publication Date Title
CN109800310B (en) Electric power operation and maintenance text analysis method based on structured expression
CN110532542B (en) Invoice false invoice identification method and system based on positive case and unmarked learning
JP4568774B2 (en) How to generate templates used in handwriting recognition
Nguyen et al. Distinguishing antonyms and synonyms in a pattern-based neural network
CN107729313B (en) Deep neural network-based polyphone pronunciation distinguishing method and device
CN107797987B (en) Bi-LSTM-CNN-based mixed corpus named entity identification method
US11055327B2 (en) Unstructured data parsing for structured information
CN108763510A (en) Intension recognizing method, device, equipment and storage medium
CN110826335A (en) Named entity identification method and device
CN113742733B (en) Method and device for extracting trigger words of reading and understanding vulnerability event and identifying vulnerability type
CN112395421B (en) Course label generation method and device, computer equipment and medium
CN115146644B (en) Alarm situation text-oriented multi-feature fusion named entity identification method
CN114756675A (en) Text classification method, related equipment and readable storage medium
CN116502628A (en) Multi-stage fusion text error correction method for government affair field based on knowledge graph
CN116127953A (en) Chinese spelling error correction method, device and medium based on contrast learning
CN115269834A (en) High-precision text classification method and device based on BERT
CN107992468A (en) A kind of mixing language material name entity recognition method based on LSTM
CN114925702A (en) Text similarity recognition method and device, electronic equipment and storage medium
CN115098673A (en) Business document information extraction method based on variant attention and hierarchical structure
CN114764566A (en) Knowledge element extraction method for aviation field
CN113220964B (en) Viewpoint mining method based on short text in network message field
CN111078874B (en) Foreign Chinese difficulty assessment method based on decision tree classification of random subspace
CN112347783A (en) Method for identifying types of alert condition record data events without trigger words
CN111523311A (en) Search intention identification method and device
CN115309899B (en) Method and system for identifying and storing specific content in text

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant