CN112084783B - Entity identification method and system based on civil aviation non-civilized passengers - Google Patents

Entity identification method and system based on civil aviation non-civilized passengers Download PDF

Info

Publication number
CN112084783B
CN112084783B CN202011016160.7A CN202011016160A CN112084783B CN 112084783 B CN112084783 B CN 112084783B CN 202011016160 A CN202011016160 A CN 202011016160A CN 112084783 B CN112084783 B CN 112084783B
Authority
CN
China
Prior art keywords
embedding
lstm
input
label
entity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011016160.7A
Other languages
Chinese (zh)
Other versions
CN112084783A (en
Inventor
曹卫东
徐秀丽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Civil Aviation University of China
Original Assignee
Civil Aviation University of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Civil Aviation University of China filed Critical Civil Aviation University of China
Priority to CN202011016160.7A priority Critical patent/CN112084783B/en
Publication of CN112084783A publication Critical patent/CN112084783A/en
Application granted granted Critical
Publication of CN112084783B publication Critical patent/CN112084783B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention discloses an entity identification method and system based on civil aviation non-civilized passengers, which belong to the technical field of civil aviation information processing and are characterized by comprising the following steps: firstly, preprocessing acquired entity data; secondly, word embedding, position embedding and label embedding are connected in series, so that input information is enriched; inputting a Bi-LSTM neural network, wherein the forward LSTM and the backward LSTM are combined to form the Bi-LSTM, and the Bi-LSTM network can use past and future input information and automatically extract context characteristics and output a prediction score of each label; inputting the output of the neural network into a CRF layer, and marking the Bi-LSTM by using a Softmax layer; and fifthly, conducting named entity identification experiments by using the sampled data sets, and evaluating the efficiency of the experiments through accuracy, recall rate and F1 values. The application can more efficiently identify eight entity types in civil aviation civilized passengers.

Description

Entity identification method and system based on civil aviation non-civilized passengers
Technical Field
The invention belongs to the technical field of civil aviation information processing, and particularly relates to an entity identification method and system based on civil aviation non-civilized passengers.
Background
Civil aviation, as a safe, fast and comfortable traffic mode, has always represented a high-end image. With the rapid development of national economy, more and more public choose civil aviation trips. But follows with various civilized behaviors of passengers, which seriously disturbs the civil aviation order and safe trip. The safety protection rule in public aviation passenger transportation flight in the domestic civil aviation passenger management field brings the behavior of the non-civilized passengers into a legal system, and provides legal basis for the airport public security department to deal with the behavior of the non-civilized passengers. The method mainly comprises the following steps of analyzing the characteristics of different airlines and different flight types (international/domestic) civilized passenger service policies, analyzing and classifying different dimensions and different attributes of civilized passenger service, constructing a civilized passenger knowledge graph, mainly comprising civilized behavior description, civilized grades, punishment adopted and corresponding supervision service rules, and meanwhile, continuously supplementing typical cases which have occurred and generate processing results, and realizing selectable, configurable and combinable flexible activation management and maintenance modes for the attributes such as the management range and the service mode.
Named Entity Recognition (NER) [1] is a basic task of natural language processing, which is mainly to find entities from text and to mark the location and category of the entities. Bikel et al originally proposed an english named entity recognition method based on hidden markov models; liao et al propose a conditional random field-based model, and adopt a semi-supervised learning algorithm to perform named entity recognition; ratinov et al [4] can effectively improve the recognition efficiency of the NER system by adopting a method of training a Word Class Model (Word Class Model) by using an unlabeled text. Chinese named entity recognition has also gained widespread attention. Zong et al propose a template-based Chinese name disambiguation mixture model; tian et al propose attribute feature-based Chinese name disambiguation for adaptive clustering; in recent years, a large number of neural network-based models have been developed and have achieved good results. Zhang et al proposed a recent model for chinese named entity recognition using the lattice long short term memory network (LSTM).
Disclosure of Invention
Technical problem
The method solves the problem that a civil aviation passenger text record statement contains a plurality of entity recognition, the BIOES embedding is connected in series in input, context information is learned through a forward network and a backward network, and a conditional random field model is used as a decoding layer of the model. And at the input layer, the Bi-LSTM layer effectively prevents overfitting using Dropout. According to the analysis of experimental results, the method can more efficiently identify eight entity types in civil aviation civilized passengers.
Technical scheme
The invention aims to provide an entity identification method based on civil aviation non-civilized passengers, which comprises the following steps:
the method comprises the following steps: and carrying out data preprocessing on the acquired entity data. And extracting data, filing the data and cleaning the data. The entities are entities with specific significance in the text, eight named entities are defined according to the characteristics of civil aviation data, and entity labeling is carried out on the named entities.
Step two: word embedding, position embedding and label embedding are connected in series, and input information is enriched.
Word embedding: using word embedding with random initialization for each word in the input sentence, the dimension of word embedding is dw
Position embedding: position embedding is used to encode the relative distance between each word and two target entities in the sentence. We believe that more useful information about relationships is hidden in words that are closer to the target entity, with the dimension d of the position embeddingp
Embedding a label: dimension of BIOES tag embedding is dt
After the three embeddings are concatenated together, a sentence is converted into a matrix X ═ w1,w2,...,wn]As an input representation, wherein the column vectors
Figure BDA0002699135370000021
Step three: input into a Bi-LSTM neural network. The forward LSTM combined with the backward LSTM forms a Bi-LSTM that can efficiently use past and future input information and automatically extract context features, outputting a predicted score for each label. Wherein the LSTM model is formed by the input word x at time ttCell state ctTemporary cell state
Figure BDA0002699135370000022
Hidden layer state htForgetting door ftMemory door itOutput gate otAnd (4) forming.
it=σ(Wiht-1+Uixt+bi)
ft=σ(Wfht-1+Ufxt+bf)
Figure BDA0002699135370000031
ot=σ(Woht-1+Uoxt+bo)
Figure BDA0002699135370000032
ht=ot·tanh(ct)
Figure BDA0002699135370000033
Here σ and tanh represent two different activation functions. x is the number oftIs the input vector of time t, htIs a state vector that stores valid information for time t. U denotes a weight matrix of an input vector, W denotes a weight matrix of a hidden state, and b denotes a bias matrix of each gate.
Step four: the output of the neural network is input into a CRF layer, a Bi-LSTM can be marked by using a Softmax layer, and since the Softmax layer marks each position respectively, an inappropriate label sequence can be obtained, so that the CRF layer needs to be added, sentence-level label information can be used, and excessive behaviors of each two different labels can be modeled.
Input text sentence X ═ X1,x2,…,xn},Y={y1,y2,…,ynIs the predicted output text sequence corresponding thereto, yiIndicating the indices of the predicted different NER labels. The score for the predicted output text sequence is calculated as:
Figure BDA0002699135370000034
where P represents a transmit probability matrix of shape (k, N) and T is the transition score. The partition function is:
Figure BDA0002699135370000035
wherein
Figure BDA0002699135370000036
Representing all possible tag sequences. The tag sequence Y belongs to YXThe probability of (c) is calculated as:
Figure BDA0002699135370000037
the training goal is to maximize the conditional log probability Y of the correct marker sequence*Comprises the following steps:
log(P(Y*|X))=S(X,Y*)-log(Z)
computing a maximum score Y of an output sequence*
Figure BDA0002699135370000038
The Viterbi algorithm is used during decoding to perform a partition function evaluation on all possible labels.
Step five: named entity identification experiments were performed using sampled data sets and the efficiency of the experiments was assessed by accuracy, recall, and F1 values.
A second object of the present invention is to provide a system for extracting relationship based on pre-trained convolutional neural network, including:
a data preprocessing module: and carrying out data preprocessing on the acquired entity data. And extracting data, filing the data and cleaning the data. The entities are entities with specific significance in the text, eight named entities are defined according to the characteristics of civil aviation data, and entity labeling is carried out on the named entities.
Embedding a module: word embedding, position embedding and label embedding are connected in series, and input information is enriched.
Word embedding: using word embedding with random initialization for each word in the input sentence, the dimension of word embedding is dw
Position embedding: position embedding is used to encode the relative distance between each word and two target entities in the sentence. We believe that more useful information about relationships is hidden in words that are closer to the target entity, with the dimension d of the position embeddingp
Embedding a label: dimension of BIOES tag embedding is dt
After the three embeddings are concatenated together, a sentence is converted into a matrix X ═ w1,w2,...,wn]As an input representation, wherein the column vectors
Figure BDA0002699135370000041
A neural network module: input into a Bi-LSTM neural network. The forward LSTM combined with the backward LSTM forms a Bi-LSTM that can efficiently use past and future input information and automatically extract context features, outputting a predicted score for each label. Wherein the LSTM model is formed by the input word x at time ttCell state ctTemporary cell State ctIn a hidden layer state htForgetting door ftMemory door itOutput gate otAnd (4) forming.
it=σ(Wiht-1+Uixt+bi)
ft=σ(Wfht-1+Ufxt+bf)
Figure BDA0002699135370000042
ot=σ(Woht-1+Uoxt+bo)
Figure BDA0002699135370000043
ht=ot·tanh(ct)
Figure BDA0002699135370000051
Here σ and tanh represent two different activation functions. x is the number oftIs the input vector of time t, htIs a state vector that stores valid information for time t. U denotes a weight matrix of an input vector, W denotes a weight matrix of a hidden state, and b denotes a bias matrix of each gate.
An information processing module: the output of the neural network is input into a CRF layer, a Bi-LSTM can be marked by using a Softmax layer, and since the Softmax layer marks each position respectively, an inappropriate label sequence can be obtained, so that the CRF layer needs to be added, sentence-level label information can be used, and excessive behaviors of each two different labels can be modeled.
Input text sentence X ═ X1,x2,…,xn},Y={y1,y2,…,ynIs the predicted output text sequence corresponding thereto, yiIndicating the indices of the predicted different NER labels. The score for the predicted output text sequence is calculated as:
Figure BDA0002699135370000052
where P represents a transmit probability matrix of shape (k, N) and T is the transition score. The partition function is:
Figure BDA0002699135370000053
wherein
Figure BDA0002699135370000054
Representing all possible tag sequences. The tag sequence Y belongs to YXThe probability of (c) is calculated as:
Figure BDA0002699135370000055
the training goal is to maximize the conditional log probability Y of the correct marker sequence*Comprises the following steps:
log(P(Y*|X))=S(X,Y*)-log(Z)
computing a maximum score Y of an output sequence*
Figure BDA0002699135370000056
The Viterbi algorithm is used during decoding to perform a partition function evaluation on all possible labels.
An evaluation module: named entity identification experiments were performed using sampled data sets and the efficiency of the experiments was assessed by accuracy, recall, and F1 values.
A third object of the present invention is to provide a computer program for implementing the above-mentioned entity identification method based on civil aviation civilized passengers.
The fourth purpose of the patent is to provide an information data processing terminal for implementing the entity identification method based on civil aviation civilized passengers.
A fifth object of the present patent is to provide a computer-readable storage medium, comprising instructions which, when run on a computer, cause the computer to perform the above-mentioned civil aviation non-civilized passenger-based entity identification method.
The invention has the advantages and positive effects that:
by adopting the technical scheme, the invention has the following technical effects:
the civil aviation non-civilized traveler entity identification method adopts a civil aviation traveler random record text crawled from credit China and aviation coordination network, defines eight kinds of named entities, carries out BIOES marking on the named entities, then embeds connecting words, embeds positions and embeds BIOES as input, not only considers the characteristic that one text record of the civil aviation non-civilized traveler has a plurality of attributes, but also fully utilizes various information of the text record, and establishes the civil aviation non-civilized traveler entity identification method. And finally, introducing a neural network and a CRF for training and identifying. The output of the model is the entities in the text sequence and their corresponding tags. Compared with the traditional method, the model accuracy and the recall rate are greatly improved, and the method has important significance for constructing the civil aviation civilized traveler knowledge map.
Drawings
FIG. 1 is a flow chart of a preferred embodiment of the present invention;
fig. 2 is a block diagram of a preferred embodiment of the present invention.
Detailed Description
In order to further understand the contents, features and effects of the present invention, the following embodiments are illustrated and described in detail with reference to the accompanying drawings.
The method comprises the steps of dividing identified entities into eight types, marking the entities by using BIOES marking, then embedding connecting characters, embedding positions and embedding labels, and identifying entity types by using a bidirectional long-short term neural network and a conditional random field to obtain entity labels.
Referring to fig. 1 and fig. 2, the specific scheme is:
an entity identification method based on civil aviation non-civilized passengers aims at various non-compliance behavior problems of calling and answering on-board of civil aviation passengers, disturbing other passengers and the like, and identifies entities in civil aviation passenger data. Firstly, defining eight named entities according to the characteristics of civil aviation passenger information; secondly, labeling the entity by using BIOES (B-begin, I-side, O-outside, E-end, S-single) label; then embedding a connecting word, embedding a position and embedding BIOES, and inputting a bidirectional long-short term memory network (Bi-LSTM); and finally, embedding a Conditional Random Field (CRF) into the neural network, and processing the output of the neural network to obtain an entity tag. The system for realizing the method comprises data preprocessing, an input embedding module, a Bi-LSTM module and a CRF module. According to the invention, the characters BIOES label embedding and the characters embedding in the named entity recognition task are connected in series in position embedding, so that the input expression is enriched, and the entity recognition efficiency is improved. The method can be applied to civil aviation non-civilized passenger data, and can accurately identify the entity information of the civil aviation non-civilized passengers.
The entity identification process is shown in fig. 1 and comprises the following steps:
the method comprises the following steps: and carrying out data preprocessing on the acquired entity data. And extracting data, filing the data and cleaning the data. According to the experimental target, the entity information of the non-civilized passengers is divided into eight categories, and entity labeling is carried out on the eight categories. As shown in tables 1 and 2.
TABLE 1 civil aviation passenger information Boson entity Categories
Figure BDA0002699135370000071
TABLE 2 civil aviation non-civilized passenger entity labeling
Figure BDA0002699135370000072
Step two: word embedding, position embedding and label embedding are connected in series, and input information is enriched.
Word embedding: using word embedding with random initialization for each word in the input sentence, the dimension of word embedding is dw
Position embedding: position embedding is used to encode the relative distance between each word and two target entities in the sentence. We believe that more useful information about relationships is hidden in words that are closer to the target entity, with the dimension d of the position embeddingp
Embedding a label: dimension of BIOES tag embedding is dt
After the three embeddings are concatenated together, a sentence is converted into a matrix X ═ w1,w2,...,wn]As an input representation, wherein the column vectors
Figure BDA0002699135370000081
Step three: input into a Bi-LSTM neural network. The forward LSTM and backward LSTM combine to form a Bi-LSTM, which network is efficientUsing past and future input information and automatically extracting context features, a predicted score for each label is output. Wherein the LSTM model is formed by the input word x at time ttCell state ctTemporary cell state
Figure BDA0002699135370000085
Hidden layer state htForgetting door ftMemory door itOutput gate otAnd (4) forming.
it=σ(Wiht-1+Uixt+bi)
ft=σ(Wfht-1+Ufxt+bf)
Figure BDA0002699135370000082
ot=σ(Woht-1+Uoxt+bo)
Figure BDA0002699135370000083
ht=ot·tanh(ct)
Figure BDA0002699135370000084
Here σ and tanh represent two different activation functions. x is the number oftIs the input vector of time t, htIs a state vector that stores valid information for time t. U denotes a weight matrix of an input vector, W denotes a weight matrix of a hidden state, and b denotes a bias matrix of each gate.
Step four: the output of the neural network is input into a CRF layer, a Bi-LSTM can be marked by using a Softmax layer, and since the Softmax layer marks each position respectively, an inappropriate label sequence can be obtained, so that the CRF layer needs to be added, sentence-level label information can be used, and excessive behaviors of each two different labels can be modeled.
Input text sentence X ═ X1,x2,…,xn},Y={y1,y2,…,ynIs the predicted output text sequence corresponding thereto, yiIndicating the indices of the predicted different NER labels. The score for the predicted output text sequence is calculated as:
Figure BDA0002699135370000091
where P represents a transmit probability matrix of shape (k, N) and T is the transition score. The partition function is:
Figure BDA0002699135370000092
wherein
Figure BDA0002699135370000093
Representing all possible tag sequences. The tag sequence Y belongs to YXThe probability of (c) is calculated as:
Figure BDA0002699135370000094
the training goal is to maximize the conditional log probability Y of the correct marker sequence*Comprises the following steps:
log(P(Y*|X))=S(X,Y*)-log(Z)
computing a maximum score Y of an output sequence*
Figure BDA0002699135370000095
The Viterbi algorithm is used during decoding to perform a partition function evaluation on all possible labels.
Step five: named entity identification experiments were performed using sampled data sets and the efficiency of the experiments was assessed by accuracy, recall, and F1 values.
A pre-trained convolutional neural network based relationship extraction system, comprising:
a data preprocessing module: and carrying out data preprocessing on the acquired entity data. And extracting data, filing the data and cleaning the data. The entities are entities with specific significance in the text, eight named entities are defined according to the characteristics of civil aviation data, and entity labeling is carried out on the named entities.
Embedding a module: word embedding, position embedding and label embedding are connected in series, and input information is enriched.
Word embedding: using word embedding with random initialization for each word in the input sentence, the dimension of word embedding is dw
Position embedding: position embedding is used to encode the relative distance between each word and two target entities in the sentence. We believe that more useful information about relationships is hidden in words that are closer to the target entity, with the dimension d of the position embeddingp
Embedding a label: dimension of BIOES tag embedding is dt
After the three embeddings are concatenated together, a sentence is converted into a matrix X ═ w1,w2,...,wn]As an input representation, wherein the column vectors
Figure BDA0002699135370000096
A neural network module: input into a Bi-LSTM neural network. The forward LSTM combined with the backward LSTM forms a Bi-LSTM that can efficiently use past and future input information and automatically extract context features, outputting a predicted score for each label. Wherein the LSTM model is formed by the input word x at time ttCell state ctTemporary cell state
Figure BDA0002699135370000101
Hidden layer state htForgetting door ftMemory door itOutput gate otAnd (4) forming.
it=σ(Wiht-1+Uixt+bi)
ft=σ(Wfht-1+Ufxt+bf)
Figure BDA0002699135370000102
ot=σ(Woht-1+Uoxt+bo)
Figure BDA0002699135370000103
ht=ot·tanh(ct)
Figure BDA0002699135370000104
Here σ and tanh represent two different activation functions. x is the number oftIs the input vector of time t, htIs a state vector that stores valid information for time t. U denotes a weight matrix of an input vector, W denotes a weight matrix of a hidden state, and b denotes a bias matrix of each gate.
An information processing module: the output of the neural network is input into a CRF layer, a Bi-LSTM can be marked by using a Softmax layer, and since the Softmax layer marks each position respectively, an inappropriate label sequence can be obtained, so that the CRF layer needs to be added, sentence-level label information can be used, and excessive behaviors of each two different labels can be modeled.
Input text sentence X ═ X1,x2,…,xn},Y={y1,y2,…,ynIs the predicted output text sequence corresponding thereto, yiIndicating the indices of the predicted different NER labels. The score for the predicted output text sequence is calculated as:
Figure BDA0002699135370000105
where P represents a transmit probability matrix of shape (k, N) and T is the transition score. The partition function is:
Figure BDA0002699135370000106
wherein
Figure BDA0002699135370000107
Representing all possible tag sequences. The tag sequence Y belongs to YXThe probability of (c) is calculated as:
Figure BDA0002699135370000108
the training goal is to maximize the conditional log probability Y of the correct marker sequence*Comprises the following steps:
log(P(Y*|X))=S(X,Y*)-log(Z)
computing a maximum score Y of an output sequence*
Figure BDA0002699135370000111
The Viterbi algorithm is used during decoding to perform a partition function evaluation on all possible labels.
An evaluation module: named entity identification experiments were performed using sampled data sets and the efficiency of the experiments was assessed by accuracy, recall, and F1 values.
A computer program for realizing the entity identification method based on civil aviation civilized passengers.
An information data processing terminal for realizing the entity identification method based on civil aviation civilized passengers.
A computer-readable storage medium comprising instructions which, when executed on a computer, cause the computer to perform the above-described civil aviation non-civilized passenger-based entity identification method.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When used in whole or in part, can be implemented in a computer program product that includes one or more computer instructions. When loaded or executed on a computer, cause the flow or functions according to embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, the computer instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL), or wireless (e.g., infrared, wireless, microwave, etc.)). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that includes one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
The above description is only for the preferred embodiment of the present invention, and is not intended to limit the present invention in any way, and all simple modifications, equivalent changes and modifications made to the above embodiment according to the technical spirit of the present invention are within the scope of the technical solution of the present invention.

Claims (4)

1. An entity identification method based on civil aviation non-civilized passengers is characterized by comprising the following steps:
firstly, carrying out data preprocessing on acquired entity data; extracting, archiving and cleaning data; according to the experimental target, dividing the entity information of the civilized passengers into 8 categories, and carrying out entity labeling on the categories;
secondly, character embedding, position embedding and label embedding are connected in series, so that input information is enriched; the method specifically comprises the following steps:
word embedding: using word embedding with random initialization for each word in the input sentence, the dimension of word embedding is dw
Position embedding: encoding the relative distance between each word and two target entities in a sentence using position embedding, the dimension of which is dp
Embedding a label: dimension of BIOES tag embedding is dt
Concatenating three embeddings, converting a sentence into a matrix X ═ w1,w2,...,wn]As an input representation, wherein the column vectors
Figure FDA0003516583240000011
Inputting a Bi-LSTM neural network, combining forward LSTM and backward LSTM to form Bi-LSTM, and outputting a prediction score of each label by the Bi-LSTM network by using past and future input information and automatically extracting context characteristics, wherein the LSTM model is formed by an input vector x at a time ttCell state ctTemporary cell state
Figure FDA0003516583240000012
Hidden layer state htForgetting door ftMemory door itOutput gate otComposition is carried out;
it=σ(Wiht-1+Uixt+bi)
ft=σ(Wfht-1+Ufxt+bf)
Figure FDA0003516583240000013
ot=σ(Woht-1+Uoxt+bo)
Figure FDA0003516583240000014
ht=ot·tanh(ct)
Figure FDA0003516583240000015
σ and tanh represent two different activation functions, xtIs the input vector at time t, htIs a hidden layer state, U represents the weight matrix of the input vector, W represents the weight matrix of the hidden state, and b represents the deviation matrix of each gate;
inputting the output of the neural network into a CRF layer, marking the Bi-LSTM by using a Softmax layer, and modeling the transition behavior of each two different labels by using sentence-level label information because the Softmax layer marks each position to obtain an inappropriate label sequence;
input text sentence X ═ X1,x2,…,xn},Y={y1,y2,…,ynIs the predicted output text sequence corresponding thereto, yiIndexes representing predicted different NER labels, the score of the predicted output text sequence is calculated as:
Figure FDA0003516583240000021
where P represents a transmit probability matrix of shape (k, N) and T is the transition score; the partition function is:
Figure FDA0003516583240000022
wherein Y represents all possible tag sequences, the tag sequence Y ∈ YXThe probability of (c) is calculated as:
Figure FDA0003516583240000023
the training goal is to maximize the conditional log probability Y of the correct marker sequence*Comprises the following steps:
log(P(Y*|X))=S(X,Y*)-log(Z)
computing output conditional Log probability Y*
Figure FDA0003516583240000024
Performing partition function evaluation on all possible labels by using a Viterbi algorithm in a decoding process;
and step five, carrying out named entity identification experiments by using the sampled data sets, and evaluating the efficiency of the experiments through accuracy, recall rate and F1 values.
2. A relation extraction system based on a pre-training convolutional neural network is characterized in that: the method comprises the following steps:
a data preprocessing module: carrying out data preprocessing on the acquired entity data; extracting, archiving and cleaning data; according to the experimental target, dividing the entity information of the civilized passengers into 8 categories, and carrying out entity labeling on the categories;
embedding a module: word embedding, position embedding and label embedding are connected in series, so that input information is enriched; the method specifically comprises the following steps:
word embedding: using word embedding with random initialization for each word in the input sentence, the dimension of word embedding is dw
Position embedding: encoding the relative distance between each word and two target entities in a sentence using position embedding, the dimension of which is dp
Embedding a label:dimension of BIOES tag embedding is dt
Concatenating three embeddings, converting a sentence into a matrix X ═ w1,w2,...,wn]As an input representation, wherein the column vectors
Figure FDA0003516583240000031
The neural network module and the input Bi-LSTM neural network are combined, the forward LSTM and the backward LSTM form the Bi-LSTM, the Bi-LSTM network can use the past and future input information and automatically extract the context characteristics and output the prediction score of each label, wherein the LSTM model is formed by an input vector x at the moment ttCell state ctTemporary cell state
Figure FDA0003516583240000032
Hidden layer state htForgetting door ftMemory door itOutput gate otComposition is carried out;
it=σ(Wiht-1+Uixt+bi)
ft=σ(Wfht-1+Ufxt+bf)
Figure FDA0003516583240000033
ot=σ(Woht-1+Uoxt+bo)
Figure FDA0003516583240000034
ht=ot·tanh(ct)
Figure FDA0003516583240000035
σ and tanh represent two different activation functions, xtIs the input vector at time t, htIs a hidden layer state, U represents the weight matrix of the input vector, W represents the weight matrix of the hidden state, and b represents the deviation matrix of each gate;
the information processing module inputs the output of the neural network into a CRF layer, the Bi-LSTM is marked by a Softmax layer, and the Softmax layer marks each position respectively to obtain an improper label sequence, so that the CRF layer needs to be added, sentence-level label information is used, and the excessive behavior of each two different labels is modeled;
input text sentence X ═ X1,x2,…,xn},Y={y1,y2,…,ynIs the predicted output text sequence corresponding thereto, yiIndexes representing predicted different NER labels, the score of the predicted output text sequence is calculated as:
Figure FDA0003516583240000036
where P represents a transmit probability matrix of shape (k, N) and T is the transition score; the partition function is:
Figure FDA0003516583240000037
wherein
Figure FDA0003516583240000038
Represents all possible tag sequences, the tag sequence Y ∈ YXThe probability of (c) is calculated as:
Figure FDA0003516583240000039
the training goal is to maximize the conditional log probability Y of the correct marker sequence*Comprises the following steps:
log(P(Y*|X))=S(X,Y*)-log(Z)
computing output conditional Log probability Y*
Figure FDA0003516583240000041
Performing partition function evaluation on all possible labels by using a Viterbi algorithm in a decoding process;
an evaluation module performs a named entity identification experiment using the sampled data set and evaluates the efficiency of the experiment by accuracy, recall, and F1 values.
3. An information data processing terminal for implementing the civil aviation non-civilized passenger-based entity identification method of claim 1.
4. A computer-readable storage medium comprising instructions that, when executed on a computer, cause the computer to perform the civil aviation non-civilized passenger-based entity identification method of claim 1.
CN202011016160.7A 2020-09-24 2020-09-24 Entity identification method and system based on civil aviation non-civilized passengers Active CN112084783B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011016160.7A CN112084783B (en) 2020-09-24 2020-09-24 Entity identification method and system based on civil aviation non-civilized passengers

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011016160.7A CN112084783B (en) 2020-09-24 2020-09-24 Entity identification method and system based on civil aviation non-civilized passengers

Publications (2)

Publication Number Publication Date
CN112084783A CN112084783A (en) 2020-12-15
CN112084783B true CN112084783B (en) 2022-04-12

Family

ID=73739018

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011016160.7A Active CN112084783B (en) 2020-09-24 2020-09-24 Entity identification method and system based on civil aviation non-civilized passengers

Country Status (1)

Country Link
CN (1) CN112084783B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113536182A (en) * 2021-07-12 2021-10-22 广州万孚生物技术股份有限公司 Method and device for generating long text webpage, electronic equipment and storage medium
CN113724882A (en) * 2021-08-30 2021-11-30 康键信息技术(深圳)有限公司 Method, apparatus, device and medium for constructing user portrait based on inquiry session

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106569998A (en) * 2016-10-27 2017-04-19 浙江大学 Text named entity recognition method based on Bi-LSTM, CNN and CRF
CN109145286A (en) * 2018-07-02 2019-01-04 昆明理工大学 Based on BiLSTM-CRF neural network model and merge the Noun Phrase Recognition Methods of Vietnamese language feature
CN110298042A (en) * 2019-06-26 2019-10-01 四川长虹电器股份有限公司 Based on Bilstm-crf and knowledge mapping video display entity recognition method
CN110765775A (en) * 2019-11-01 2020-02-07 北京邮电大学 Self-adaptive method for named entity recognition field fusing semantics and label differences
CN111090987A (en) * 2019-12-27 2020-05-01 北京百度网讯科技有限公司 Method and apparatus for outputting information
CN111178074A (en) * 2019-12-12 2020-05-19 天津大学 Deep learning-based Chinese named entity recognition method
CN111191452A (en) * 2019-12-24 2020-05-22 中国铁道科学研究院集团有限公司电子计算技术研究所 Railway text named entity recognition method and device

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11574122B2 (en) * 2018-08-23 2023-02-07 Shenzhen Keya Medical Technology Corporation Method and system for joint named entity recognition and relation extraction using convolutional neural network
CN109871535B (en) * 2019-01-16 2020-01-10 四川大学 French named entity recognition method based on deep neural network

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106569998A (en) * 2016-10-27 2017-04-19 浙江大学 Text named entity recognition method based on Bi-LSTM, CNN and CRF
CN109145286A (en) * 2018-07-02 2019-01-04 昆明理工大学 Based on BiLSTM-CRF neural network model and merge the Noun Phrase Recognition Methods of Vietnamese language feature
CN110298042A (en) * 2019-06-26 2019-10-01 四川长虹电器股份有限公司 Based on Bilstm-crf and knowledge mapping video display entity recognition method
CN110765775A (en) * 2019-11-01 2020-02-07 北京邮电大学 Self-adaptive method for named entity recognition field fusing semantics and label differences
CN111178074A (en) * 2019-12-12 2020-05-19 天津大学 Deep learning-based Chinese named entity recognition method
CN111191452A (en) * 2019-12-24 2020-05-22 中国铁道科学研究院集团有限公司电子计算技术研究所 Railway text named entity recognition method and device
CN111090987A (en) * 2019-12-27 2020-05-01 北京百度网讯科技有限公司 Method and apparatus for outputting information

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"A Method of Chinese Tourism Named Entity Recognition Based on BBLC Model";Leyi Xue etc.;《2019 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computing, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation》;20200409;全文 *

Also Published As

Publication number Publication date
CN112084783A (en) 2020-12-15

Similar Documents

Publication Publication Date Title
CN110427623B (en) Semi-structured document knowledge extraction method and device, electronic equipment and storage medium
CN111709241B (en) Named entity identification method oriented to network security field
CN112084337B (en) Training method of text classification model, text classification method and equipment
CN110532557B (en) Unsupervised text similarity calculation method
CN112732916B (en) BERT-based multi-feature fusion fuzzy text classification system
CN112560478B (en) Chinese address Roberta-BiLSTM-CRF coupling analysis method using semantic annotation
CN110532381B (en) Text vector acquisition method and device, computer equipment and storage medium
CN111858940B (en) Multi-head attention-based legal case similarity calculation method and system
CN113392209B (en) Text clustering method based on artificial intelligence, related equipment and storage medium
CN111291195A (en) Data processing method, device, terminal and readable storage medium
CN112084783B (en) Entity identification method and system based on civil aviation non-civilized passengers
CN112395421B (en) Course label generation method and device, computer equipment and medium
CN111753082A (en) Text classification method and device based on comment data, equipment and medium
CN112084779B (en) Entity acquisition method, device, equipment and storage medium for semantic recognition
CN115859980A (en) Semi-supervised named entity identification method, system and electronic equipment
CN114997288A (en) Design resource association method
CN115017879A (en) Text comparison method, computer device and computer storage medium
CN113312498B (en) Text information extraction method for embedding knowledge graph by undirected graph
CN113360582B (en) Relation classification method and system based on BERT model fusion multi-entity information
CN113434636A (en) Semantic-based approximate text search method and device, computer equipment and medium
CN113158659A (en) Case-related property calculation method based on judicial text
Zhu et al. Design of knowledge graph retrieval system for legal and regulatory framework of multilevel latent semantic indexing
CN113420119B (en) Intelligent question-answering method, device, equipment and storage medium based on knowledge card
CN114911940A (en) Text emotion recognition method and device, electronic equipment and storage medium
CN114722832A (en) Abstract extraction method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant