CN112084783B

CN112084783B - Entity identification method and system based on civil aviation non-civilized passengers

Info

Publication number: CN112084783B
Application number: CN202011016160.7A
Authority: CN
Inventors: 曹卫东; 徐秀丽
Original assignee: Civil Aviation University of China
Current assignee: Civil Aviation University of China
Priority date: 2020-09-24
Filing date: 2020-09-24
Publication date: 2022-04-12
Anticipated expiration: 2040-09-24
Also published as: CN112084783A

Abstract

The invention discloses an entity identification method and system based on civil aviation non-civilized passengers, which belong to the technical field of civil aviation information processing and are characterized by comprising the following steps: firstly, preprocessing acquired entity data; secondly, word embedding, position embedding and label embedding are connected in series, so that input information is enriched; inputting a Bi-LSTM neural network, wherein the forward LSTM and the backward LSTM are combined to form the Bi-LSTM, and the Bi-LSTM network can use past and future input information and automatically extract context characteristics and output a prediction score of each label; inputting the output of the neural network into a CRF layer, and marking the Bi-LSTM by using a Softmax layer; and fifthly, conducting named entity identification experiments by using the sampled data sets, and evaluating the efficiency of the experiments through accuracy, recall rate and F1 values. The application can more efficiently identify eight entity types in civil aviation civilized passengers.

Description

Entity identification method and system based on civil aviation non-civilized passengers

Technical Field

The invention belongs to the technical field of civil aviation information processing, and particularly relates to an entity identification method and system based on civil aviation non-civilized passengers.

Background

Civil aviation, as a safe, fast and comfortable traffic mode, has always represented a high-end image. With the rapid development of national economy, more and more public choose civil aviation trips. But follows with various civilized behaviors of passengers, which seriously disturbs the civil aviation order and safe trip. The safety protection rule in public aviation passenger transportation flight in the domestic civil aviation passenger management field brings the behavior of the non-civilized passengers into a legal system, and provides legal basis for the airport public security department to deal with the behavior of the non-civilized passengers. The method mainly comprises the following steps of analyzing the characteristics of different airlines and different flight types (international/domestic) civilized passenger service policies, analyzing and classifying different dimensions and different attributes of civilized passenger service, constructing a civilized passenger knowledge graph, mainly comprising civilized behavior description, civilized grades, punishment adopted and corresponding supervision service rules, and meanwhile, continuously supplementing typical cases which have occurred and generate processing results, and realizing selectable, configurable and combinable flexible activation management and maintenance modes for the attributes such as the management range and the service mode.

Named Entity Recognition (NER) [1] is a basic task of natural language processing, which is mainly to find entities from text and to mark the location and category of the entities. Bikel et al originally proposed an english named entity recognition method based on hidden markov models; liao et al propose a conditional random field-based model, and adopt a semi-supervised learning algorithm to perform named entity recognition; ratinov et al [4] can effectively improve the recognition efficiency of the NER system by adopting a method of training a Word Class Model (Word Class Model) by using an unlabeled text. Chinese named entity recognition has also gained widespread attention. Zong et al propose a template-based Chinese name disambiguation mixture model; tian et al propose attribute feature-based Chinese name disambiguation for adaptive clustering; in recent years, a large number of neural network-based models have been developed and have achieved good results. Zhang et al proposed a recent model for chinese named entity recognition using the lattice long short term memory network (LSTM).

Disclosure of Invention

Technical problem

The method solves the problem that a civil aviation passenger text record statement contains a plurality of entity recognition, the BIOES embedding is connected in series in input, context information is learned through a forward network and a backward network, and a conditional random field model is used as a decoding layer of the model. And at the input layer, the Bi-LSTM layer effectively prevents overfitting using Dropout. According to the analysis of experimental results, the method can more efficiently identify eight entity types in civil aviation civilized passengers.

Technical scheme

The invention aims to provide an entity identification method based on civil aviation non-civilized passengers, which comprises the following steps:

the method comprises the following steps: and carrying out data preprocessing on the acquired entity data. And extracting data, filing the data and cleaning the data. The entities are entities with specific significance in the text, eight named entities are defined according to the characteristics of civil aviation data, and entity labeling is carried out on the named entities.

Step two: word embedding, position embedding and label embedding are connected in series, and input information is enriched.

Word embedding: using word embedding with random initialization for each word in the input sentence, the dimension of word embedding is d_w。

Position embedding: position embedding is used to encode the relative distance between each word and two target entities in the sentence. We believe that more useful information about relationships is hidden in words that are closer to the target entity, with the dimension d of the position embedding_p。

Embedding a label: dimension of BIOES tag embedding is d_t。

After the three embeddings are concatenated together, a sentence is converted into a matrix X ═ w₁，w₂，...，w_n]As an input representation, wherein the column vectors

Step three: input into a Bi-LSTM neural network. The forward LSTM combined with the backward LSTM forms a Bi-LSTM that can efficiently use past and future input information and automatically extract context features, outputting a predicted score for each label. Wherein the LSTM model is formed by the input word x at time t_tCell state c_tTemporary cell state

Hidden layer state h_tForgetting door f_tMemory door i_tOutput gate o_tAnd (4) forming.

i_t＝σ(W_ih_t-1+U_ix_t+b_i)

f_t＝σ(W_fh_t-1+U_fx_t+b_f)

o_t＝σ(W_oh_t-1+U_ox_t+b_o)

h_t＝o_t·tanh(c_t)

Here σ and tanh represent two different activation functions. x is the number of_tIs the input vector of time t, h_tIs a state vector that stores valid information for time t. U denotes a weight matrix of an input vector, W denotes a weight matrix of a hidden state, and b denotes a bias matrix of each gate.

Step four: the output of the neural network is input into a CRF layer, a Bi-LSTM can be marked by using a Softmax layer, and since the Softmax layer marks each position respectively, an inappropriate label sequence can be obtained, so that the CRF layer needs to be added, sentence-level label information can be used, and excessive behaviors of each two different labels can be modeled.

Input text sentence X ═ X₁,x₂,…,x_n}，Y＝{y₁,y₂,…,y_nIs the predicted output text sequence corresponding thereto, y_iIndicating the indices of the predicted different NER labels. The score for the predicted output text sequence is calculated as:

where P represents a transmit probability matrix of shape (k, N) and T is the transition score. The partition function is:

wherein

Representing all possible tag sequences. The tag sequence Y belongs to Y_XThe probability of (c) is calculated as:

the training goal is to maximize the conditional log probability Y of the correct marker sequence^*Comprises the following steps:

log(P(Y^*|X))＝S(X,Y^*)-log(Z)

computing a maximum score Y of an output sequence^*：

The Viterbi algorithm is used during decoding to perform a partition function evaluation on all possible labels.

Step five: named entity identification experiments were performed using sampled data sets and the efficiency of the experiments was assessed by accuracy, recall, and F1 values.

A second object of the present invention is to provide a system for extracting relationship based on pre-trained convolutional neural network, including:

a data preprocessing module: and carrying out data preprocessing on the acquired entity data. And extracting data, filing the data and cleaning the data. The entities are entities with specific significance in the text, eight named entities are defined according to the characteristics of civil aviation data, and entity labeling is carried out on the named entities.

Embedding a module: word embedding, position embedding and label embedding are connected in series, and input information is enriched.

Embedding a label: dimension of BIOES tag embedding is d_t。

A neural network module: input into a Bi-LSTM neural network. The forward LSTM combined with the backward LSTM forms a Bi-LSTM that can efficiently use past and future input information and automatically extract context features, outputting a predicted score for each label. Wherein the LSTM model is formed by the input word x at time t_tCell state c_tTemporary cell State c_tIn a hidden layer state h_tForgetting door f_tMemory door i_tOutput gate o_tAnd (4) forming.

i_t＝σ(W_ih_t-1+U_ix_t+b_i)

f_t＝σ(W_fh_t-1+U_fx_t+b_f)

o_t＝σ(W_oh_t-1+U_ox_t+b_o)

h_t＝o_t·tanh(c_t)

An information processing module: the output of the neural network is input into a CRF layer, a Bi-LSTM can be marked by using a Softmax layer, and since the Softmax layer marks each position respectively, an inappropriate label sequence can be obtained, so that the CRF layer needs to be added, sentence-level label information can be used, and excessive behaviors of each two different labels can be modeled.

wherein

log(P(Y^*|X))＝S(X,Y^*)-log(Z)

computing a maximum score Y of an output sequence^*：

An evaluation module: named entity identification experiments were performed using sampled data sets and the efficiency of the experiments was assessed by accuracy, recall, and F1 values.

A third object of the present invention is to provide a computer program for implementing the above-mentioned entity identification method based on civil aviation civilized passengers.

The fourth purpose of the patent is to provide an information data processing terminal for implementing the entity identification method based on civil aviation civilized passengers.

A fifth object of the present patent is to provide a computer-readable storage medium, comprising instructions which, when run on a computer, cause the computer to perform the above-mentioned civil aviation non-civilized passenger-based entity identification method.

The invention has the advantages and positive effects that:

by adopting the technical scheme, the invention has the following technical effects:

the civil aviation non-civilized traveler entity identification method adopts a civil aviation traveler random record text crawled from credit China and aviation coordination network, defines eight kinds of named entities, carries out BIOES marking on the named entities, then embeds connecting words, embeds positions and embeds BIOES as input, not only considers the characteristic that one text record of the civil aviation non-civilized traveler has a plurality of attributes, but also fully utilizes various information of the text record, and establishes the civil aviation non-civilized traveler entity identification method. And finally, introducing a neural network and a CRF for training and identifying. The output of the model is the entities in the text sequence and their corresponding tags. Compared with the traditional method, the model accuracy and the recall rate are greatly improved, and the method has important significance for constructing the civil aviation civilized traveler knowledge map.

Drawings

FIG. 1 is a flow chart of a preferred embodiment of the present invention;

fig. 2 is a block diagram of a preferred embodiment of the present invention.

Detailed Description

In order to further understand the contents, features and effects of the present invention, the following embodiments are illustrated and described in detail with reference to the accompanying drawings.

The method comprises the steps of dividing identified entities into eight types, marking the entities by using BIOES marking, then embedding connecting characters, embedding positions and embedding labels, and identifying entity types by using a bidirectional long-short term neural network and a conditional random field to obtain entity labels.

Referring to fig. 1 and fig. 2, the specific scheme is:

an entity identification method based on civil aviation non-civilized passengers aims at various non-compliance behavior problems of calling and answering on-board of civil aviation passengers, disturbing other passengers and the like, and identifies entities in civil aviation passenger data. Firstly, defining eight named entities according to the characteristics of civil aviation passenger information; secondly, labeling the entity by using BIOES (B-begin, I-side, O-outside, E-end, S-single) label; then embedding a connecting word, embedding a position and embedding BIOES, and inputting a bidirectional long-short term memory network (Bi-LSTM); and finally, embedding a Conditional Random Field (CRF) into the neural network, and processing the output of the neural network to obtain an entity tag. The system for realizing the method comprises data preprocessing, an input embedding module, a Bi-LSTM module and a CRF module. According to the invention, the characters BIOES label embedding and the characters embedding in the named entity recognition task are connected in series in position embedding, so that the input expression is enriched, and the entity recognition efficiency is improved. The method can be applied to civil aviation non-civilized passenger data, and can accurately identify the entity information of the civil aviation non-civilized passengers.

The entity identification process is shown in fig. 1 and comprises the following steps:

the method comprises the following steps: and carrying out data preprocessing on the acquired entity data. And extracting data, filing the data and cleaning the data. According to the experimental target, the entity information of the non-civilized passengers is divided into eight categories, and entity labeling is carried out on the eight categories. As shown in tables 1 and 2.

TABLE 1 civil aviation passenger information Boson entity Categories

TABLE 2 civil aviation non-civilized passenger entity labeling

Embedding a label: dimension of BIOES tag embedding is d_t。

Step three: input into a Bi-LSTM neural network. The forward LSTM and backward LSTM combine to form a Bi-LSTM, which network is efficientUsing past and future input information and automatically extracting context features, a predicted score for each label is output. Wherein the LSTM model is formed by the input word x at time t_tCell state c_tTemporary cell state

i_t＝σ(W_ih_t-1+U_ix_t+b_i)

f_t＝σ(W_fh_t-1+U_fx_t+b_f)

o_t＝σ(W_oh_t-1+U_ox_t+b_o)

h_t＝o_t·tanh(c_t)

wherein

log(P(Y^*|X))＝S(X,Y^*)-log(Z)

computing a maximum score Y of an output sequence^*：

A pre-trained convolutional neural network based relationship extraction system, comprising:

Embedding a label: dimension of BIOES tag embedding is d_t。

A neural network module: input into a Bi-LSTM neural network. The forward LSTM combined with the backward LSTM forms a Bi-LSTM that can efficiently use past and future input information and automatically extract context features, outputting a predicted score for each label. Wherein the LSTM model is formed by the input word x at time t_tCell state c_tTemporary cell state

i_t＝σ(W_ih_t-1+U_ix_t+b_i)

f_t＝σ(W_fh_t-1+U_fx_t+b_f)

o_t＝σ(W_oh_t-1+U_ox_t+b_o)

h_t＝o_t·tanh(c_t)

wherein

log(P(Y^*|X))＝S(X,Y^*)-log(Z)

computing a maximum score Y of an output sequence^*：

A computer program for realizing the entity identification method based on civil aviation civilized passengers.

An information data processing terminal for realizing the entity identification method based on civil aviation civilized passengers.

A computer-readable storage medium comprising instructions which, when executed on a computer, cause the computer to perform the above-described civil aviation non-civilized passenger-based entity identification method.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When used in whole or in part, can be implemented in a computer program product that includes one or more computer instructions. When loaded or executed on a computer, cause the flow or functions according to embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, the computer instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL), or wireless (e.g., infrared, wireless, microwave, etc.)). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that includes one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the present invention in any way, and all simple modifications, equivalent changes and modifications made to the above embodiment according to the technical spirit of the present invention are within the scope of the technical solution of the present invention.

Claims

1. An entity identification method based on civil aviation non-civilized passengers is characterized by comprising the following steps:

firstly, carrying out data preprocessing on acquired entity data; extracting, archiving and cleaning data; according to the experimental target, dividing the entity information of the civilized passengers into 8 categories, and carrying out entity labeling on the categories;

secondly, character embedding, position embedding and label embedding are connected in series, so that input information is enriched; the method specifically comprises the following steps:

word embedding: using word embedding with random initialization for each word in the input sentence, the dimension of word embedding is d_w；

Position embedding: encoding the relative distance between each word and two target entities in a sentence using position embedding, the dimension of which is d_p；

Embedding a label: dimension of BIOES tag embedding is d_t；

Concatenating three embeddings, converting a sentence into a matrix X ═ w₁，w₂，...，w_n]As an input representation, wherein the column vectors

Inputting a Bi-LSTM neural network, combining forward LSTM and backward LSTM to form Bi-LSTM, and outputting a prediction score of each label by the Bi-LSTM network by using past and future input information and automatically extracting context characteristics, wherein the LSTM model is formed by an input vector x at a time t_tCell state c_tTemporary cell state

Hidden layer state h_tForgetting door f_tMemory door i_tOutput gate o_tComposition is carried out;

i_t＝σ(W_ih_t-1+U_ix_t+b_i)

f_t＝σ(W_fh_t-1+U_fx_t+b_f)

o_t＝σ(W_oh_t-1+U_ox_t+b_o)

h_t＝o_t·tanh(c_t)

σ and tanh represent two different activation functions, x_tIs the input vector at time t, h_tIs a hidden layer state, U represents the weight matrix of the input vector, W represents the weight matrix of the hidden state, and b represents the deviation matrix of each gate;

inputting the output of the neural network into a CRF layer, marking the Bi-LSTM by using a Softmax layer, and modeling the transition behavior of each two different labels by using sentence-level label information because the Softmax layer marks each position to obtain an inappropriate label sequence;

input text sentence X ═ X₁,x₂,…,x_n}，Y＝{y₁,y₂,…,y_nIs the predicted output text sequence corresponding thereto, y_iIndexes representing predicted different NER labels, the score of the predicted output text sequence is calculated as:

where P represents a transmit probability matrix of shape (k, N) and T is the transition score; the partition function is:

wherein Y represents all possible tag sequences, the tag sequence Y ∈ Y_XThe probability of (c) is calculated as:

log(P(Y^*|X))＝S(X,Y^*)-log(Z)

computing output conditional Log probability Y^*：

Performing partition function evaluation on all possible labels by using a Viterbi algorithm in a decoding process;

and step five, carrying out named entity identification experiments by using the sampled data sets, and evaluating the efficiency of the experiments through accuracy, recall rate and F1 values.

2. A relation extraction system based on a pre-training convolutional neural network is characterized in that: the method comprises the following steps:

a data preprocessing module: carrying out data preprocessing on the acquired entity data; extracting, archiving and cleaning data; according to the experimental target, dividing the entity information of the civilized passengers into 8 categories, and carrying out entity labeling on the categories;

embedding a module: word embedding, position embedding and label embedding are connected in series, so that input information is enriched; the method specifically comprises the following steps:

Embedding a label:dimension of BIOES tag embedding is d_t；

The neural network module and the input Bi-LSTM neural network are combined, the forward LSTM and the backward LSTM form the Bi-LSTM, the Bi-LSTM network can use the past and future input information and automatically extract the context characteristics and output the prediction score of each label, wherein the LSTM model is formed by an input vector x at the moment t_tCell state c_tTemporary cell state

i_t＝σ(W_ih_t-1+U_ix_t+b_i)

f_t＝σ(W_fh_t-1+U_fx_t+b_f)

o_t＝σ(W_oh_t-1+U_ox_t+b_o)

h_t＝o_t·tanh(c_t)

the information processing module inputs the output of the neural network into a CRF layer, the Bi-LSTM is marked by a Softmax layer, and the Softmax layer marks each position respectively to obtain an improper label sequence, so that the CRF layer needs to be added, sentence-level label information is used, and the excessive behavior of each two different labels is modeled;

wherein

Represents all possible tag sequences, the tag sequence Y ∈ Y_XThe probability of (c) is calculated as:

log(P(Y^*|X))＝S(X,Y^*)-log(Z)

computing output conditional Log probability Y^*：

an evaluation module performs a named entity identification experiment using the sampled data set and evaluates the efficiency of the experiment by accuracy, recall, and F1 values.

3. An information data processing terminal for implementing the civil aviation non-civilized passenger-based entity identification method of claim 1.

4. A computer-readable storage medium comprising instructions that, when executed on a computer, cause the computer to perform the civil aviation non-civilized passenger-based entity identification method of claim 1.