CN108595430B

CN108595430B - Aviation transformer information extraction method and system

Info

Publication number: CN108595430B
Application number: CN201810385920.8A
Authority: CN
Inventors: 汪政; 张勇; 金丽丽; 苏达鼐; 曹媛媛; 汪庆; 陈之彦
Original assignee: Ctrip Travel Network Technology Shanghai Co Ltd
Current assignee: Ctrip Travel Network Technology Shanghai Co Ltd
Priority date: 2018-04-26
Filing date: 2018-04-26
Publication date: 2022-02-22
Anticipated expiration: 2038-04-26
Also published as: CN108595430A

Abstract

The invention discloses a method and a system for extracting aeroderivative information, wherein the extraction method comprises the following steps: s1, constructing a dictionary according to the plurality of aviation variation sample information; s2, carrying out word pre-segmentation on the aviation change sample information according to the dictionary to obtain word segmentation training corpora; s3, training the first CRF model by utilizing the participle training corpus to obtain a Chinese participle model; s4, inputting aviation variation sample information to the Chinese word segmentation model to obtain an entity training corpus; s5, training a second CRF model by utilizing entity training corpora to obtain an entity analysis model; and S6, sequentially inputting the flight variation information into the Chinese word segmentation model and the entity analysis model to obtain the content of the preset flight entity information. The method improves the identification efficiency of the flight entity information, and tests of a plurality of aviation transformer test sample information prove that the accuracy of extracting the flight entity information is greatly improved by using a Chinese word segmentation model and an entity analysis model trained by the aviation transformer sample information compared with a template matching method based on a regular expression.

Description

Aviation transformer information extraction method and system

Technical Field

The invention relates to the field of word information processing, in particular to a method and a system for extracting aeroderivative information.

Background

After a traveler orders a travel ticket at an OTA (online travel agency) service terminal, the OTA service terminal usually sends related aviation information to a device terminal of the traveler to remind the traveler due to the fact that the traveler needs to adjust the original flight, including changing the model or route, canceling, advancing, interrupting, delaying or postponing flight and the like, because of weather, air traffic control, airline maintenance, flight scheduling and the like.

The flight change information usually includes entities and entity relationships, and the entities and entity relationships include passenger names (conventional Chinese names, transliterated foreign names, minority names, and the like), flight numbers, airports, dates, times, and other information, and relationships between related information before and after flight changes.

Analysis of aviation transformer information in the prior art, similar information is merged as a class, and a regular expression matching template is manually configured for each class of information, but the template matching method has the following problems:

the analysis of the aeroderivative information is inaccurate, so that part of effective information is deleted;

the aeroderivative information template changes frequently, and the maintenance difficulty is high;

the template which does not accord with the analysis condition is manually maintained and analyzed, and the workload is large.

The conventional technical solution to solve these problems at present is to adopt the chinese word segmentation and named entity recognition scheme in natural language processing technology. At present, many open-source natural language processing frameworks are based on open-source corpus learning and cannot be well applied to aeronautical transformation information analysis in the OTA field. The existing open source scheme mainly has the following difficulties:

the granularity of word segmentation cannot be self-adaptive according to application scenes. Named entity identification cannot distinguish entity relationships, such as related information content before and after a flight change and interrelationships of information content.

Disclosure of Invention

The invention aims to overcome the defects of low identification efficiency and low identification accuracy of related flight entity information in aviation transformer information in the prior art, and provides an aviation transformer information extraction method.

The invention solves the technical problems through the following technical scheme:

a method for extracting aeroderivative information comprises the following steps:

s1, constructing a dictionary according to the plurality of aviation variation sample information;

s2, performing word segmentation on the aviation variation sample information according to the dictionary to obtain pre-segmentation training corpus; setting a label sequence for the participle training corpus, and constructing a feature vector for the participle training corpus;

s3, training a first CRF (conditional random field) model by using the feature vectors and the label sequences of the word segmentation training corpus to obtain a Chinese word segmentation model, wherein the Chinese word segmentation model is used for segmenting the aviation change information according to the dictionary to obtain word segmentation information;

s4, inputting aviation variation sample information to the Chinese word segmentation model to obtain an entity training corpus, labeling the entity training corpus according to preset flight entity information to obtain a labeling sequence, and constructing a feature vector for the entity training corpus, wherein the preset flight entity information is used for representing variation information of flights;

s5, training a second CRF model by using the feature vectors and the labeling sequences of the entity training corpus to obtain an entity analysis model, wherein the entity analysis model is used for analyzing the participle information according to the preset flight entity information to obtain the content of the preset flight entity information;

and S6, sequentially inputting the flight variation information into the Chinese word segmentation model and the entity analysis model to obtain the content of the preset flight entity information.

Preferably, the step S2 includes:

s21, carrying out pre-segmentation on the aviation variation sample information to obtain pre-segmentation corpora;

s22, merging the pre-participle corpus according to the dictionary to obtain the participle training corpus, wherein the participle training corpus comprises a plurality of first participles;

s23, setting labels for the first participles respectively to obtain label sequences;

s24, each first participle comprises a plurality of characters, characteristics are built for each character, characteristic vectors of the first participle are built by using the characteristics of the characters and the dictionary, and characteristic vectors of the participle training corpus are built by using the characteristic vectors of the first participle.

Preferably, the entity corpus includes a plurality of second participles, and the step of constructing feature vectors for the entity corpus in step S4 includes:

and respectively constructing features for the plurality of second participles, and constructing feature vectors of the entity training corpus by using the features of the second participles and the dictionary.

Preferably, in the step S1, before the dictionary is constructed, the traditional chinese encoding format of the plurality of aviation change sample information is converted into the simplified chinese encoding format.

Preferably, the step S4 further includes:

marking the entity training corpus according to a role table formed by names to obtain a role marking sequence, and training an HMM (hidden Markov) model by using the role marking sequence to obtain a name recognition model;

the step S6 includes:

inputting the aerovariant information into the HMM model, wherein the aerovariant information comprises name information, and solving by using a Viterbi (dynamic rule algorithm) algorithm to obtain the name information in the aerovariant information.

An aeroderivative information extraction system, the aeroderivative information extraction system comprising:

the dictionary construction module is used for constructing a dictionary according to the plurality of aviation variation sample information;

the word segmentation characteristic construction module is used for carrying out word segmentation on the aviation change sample information according to the dictionary to obtain word segmentation training corpora; the system is also used for setting a label sequence for the participle training corpus and constructing a feature vector for the participle training corpus;

the Chinese word segmentation model training module is used for training a first CRF (learning control parameter) model by using the feature vectors and the label sequences of the word segmentation training corpus to obtain a Chinese word segmentation model, and the Chinese word segmentation model is used for segmenting the aviation change information according to the dictionary to obtain word segmentation information;

the entity feature construction module is used for inputting aviation variation sample information to the Chinese word segmentation model to obtain an entity training corpus, marking the entity training corpus according to preset flight entity information to obtain a marking sequence, and constructing a feature vector for the entity training corpus, wherein the preset flight entity information is used for representing variation information of flights;

the entity analysis model training module is used for training a second CRF (model reference frame) model by utilizing the characteristic vector and the tagging sequence of the entity training corpus to obtain an entity analysis model, and the entity analysis model is used for analyzing the word segmentation information according to the preset flight entity information to obtain the content of the preset flight entity information;

and the entity analysis module is used for sequentially inputting the flight variation information into the Chinese word segmentation model and the entity analysis model to obtain the content of the preset flight entity information.

Preferably, the word segmentation feature construction module is further configured to perform word pre-segmentation on the aviation variation sample information to obtain a pre-segmented word corpus; the word segmentation training corpus is also used for merging the pre-word segmentation corpus according to the dictionary to obtain a word segmentation training corpus, and the word segmentation training corpus comprises a plurality of first words; setting labels for a plurality of first participles respectively to obtain a label sequence, wherein each first participle comprises a plurality of characters;

the word segmentation feature construction module is further used for constructing features for each character, constructing a feature vector of the first segmentation by using the features of the characters and the dictionary, and constructing a feature vector of the word segmentation training corpus by using the feature vector of the first segmentation.

Preferably, the entity training corpus includes a plurality of second participles, and the entity feature construction module is further configured to respectively construct features for the plurality of second participles, and construct feature vectors of the entity training corpus by using the features of the second participles and the dictionary.

Preferably, the dictionary construction module is further configured to convert the traditional chinese coding format of the plurality of aviation change sample information into the simplified chinese coding format before constructing the dictionary.

Preferably, the entity feature construction module is further configured to label the entity training corpus according to a role table formed by names to obtain a role labeling sequence, and train an HMM model by using the role labeling sequence to obtain a name recognition model;

the entity analysis module is further used for inputting aviation variation information into the HMM model, wherein the aviation variation information comprises name information, and the name information in the aviation variation information is obtained through solving by a Viterbi algorithm. The positive progress effects of the invention are as follows:

the method for extracting the aviation variation information constructs a dictionary according to a plurality of aviation variation sample information; training a first CRF model by using the aerovariant sample information according to a dictionary to obtain a Chinese word segmentation model; utilizing a Chinese word segmentation model to segment the aviation change sample information to obtain entity training corpus, and utilizing the entity training corpus to train a second CRF model to obtain an entity analysis model; and sequentially inputting the flight variation information into the Chinese word segmentation model and the entity analysis model to obtain the content of the flight entity information. The flight variant information extraction method can automatically recognize and extract flight entity information through the Chinese word segmentation model and the entity analysis model, improves the flight entity information recognition efficiency, and greatly improves the accuracy of flight entity information extraction compared with a template matching method based on a regular expression by using the Chinese word segmentation model and the entity analysis model trained by the flight variant sample information through the test of a plurality of flight variant test sample information.

Drawings

Fig. 1 is a flowchart of a method for extracting aviation change information according to embodiment 1 of the present invention.

Fig. 2 is a flowchart of step 102 of the method for extracting aviation change information according to embodiment 1 of the present invention.

Fig. 3 is a schematic structural diagram of an SEG-CRF feature in the method for extracting aeronautical variation information according to embodiment 1 of the present invention.

Fig. 4 is a schematic diagram illustrating a structure of the NER-CRF feature in the method for extracting aeronautical variation information according to embodiment 1 of the present invention.

Fig. 5 is a schematic block diagram of a aeronautical variation information extraction system according to embodiment 2 of the present invention.

Detailed Description

The invention is further illustrated by the following examples, which are not intended to limit the scope of the invention.

The aviation change adjusts the original flight, including changing the model or route, canceling, advancing, interrupting, delaying or postponing the flight and the like, due to the reasons of weather, air traffic control, maintenance of the flight, flight dispatching and the like. In the case of a flight change, information to be notified to a traveler is referred to as flight change information.

Example 1

The present embodiment provides a method for extracting aeroderivative information, as shown in fig. 1, the method for extracting aeroderivative information includes:

step 101, constructing a dictionary according to a plurality of pieces of aviation variation sample information.

The characters in the aviation change sample information of the embodiment are preferably Chinese, and the Chinese includes traditional Chinese and simplified Chinese. Before the dictionary is constructed, the traditional Chinese coding format of a plurality of aviation change sample information can be converted into the simplified Chinese coding format. Firstly, judging whether the Chinese aviation change information in the original aviation change sample information is simplified or not, if the Chinese aviation change information is simplified or not, converting the Chinese aviation change information into simplified Chinese, otherwise, not converting the Chinese aviation change information, and then, performing full-angle to half-angle processing.

The word segmentation granularity in the prior art cannot be self-adaptive according to an application scene. The analysis of the aeroderivative information needs to be as long as possible in the length of segmentation of entities such as airports, dates and the like, for example, the 'Shanghai Pudong airport' cannot be segmented into three separate words of 'Shanghai', 'Pudong' and 'airport', and when a dictionary is constructed, the 'Shanghai Pudong airport' is constructed in the dictionary as one word.

102, carrying out word pre-segmentation on the aviation variation sample information according to a dictionary to obtain word segmentation training corpora; and setting a label sequence for the word segmentation training corpus, and constructing a feature vector for the word segmentation training corpus.

Preferably, as shown in fig. 2, step 102 specifically includes:

and step 1021, performing pre-segmentation on the aviation variation sample information to obtain pre-segmented word aggregates.

In this embodiment, an open source HanLP (natural language processing package) word segmentation tool is used to pre-segment the aviation variant information to obtain a pre-segmented word corpus. For example, if the aviation variation sample information contains the 'Shanghai rainbow bridge airport', three participles of 'Shanghai', 'rainbow bridge' and 'airport' can be obtained after pre-participling.

And 1022, merging the pre-divided word materials according to the dictionary to obtain a word segmentation training corpus, wherein the word segmentation training corpus comprises a plurality of first words.

Merging and cleaning words such as airports, flight numbers, time, dates and the like according to the contents of the dictionary records constructed in the previous step, such as merging the words 'Shanghai', 'rainbow bridge' and 'airport' into a word 'Shanghai rainbow bridge airport'; the six words of "2017", "year", "3", "month", "25" and "day" are combined into 1 word to represent a specific date.

As in the previous example, the aviation variation sample information includes a word "shanghai rainbow bridge airport", and after pre-word segmentation, three words "shanghai", "rainbow bridge" and "airport" are obtained, and the actually desired result is the combined "shanghai rainbow bridge airport".

In the step, pre-divided word materials in the previous step are merged, namely cleaned, for example, three words of Shanghai, rainbow bridge and airport continuously appear in the pre-divided word materials, the three words are merged according to a dictionary, so that the fact that only the rainbow bridge airport appears in the pre-divided word materials is ensured, three independent words do not appear, and the word-dividing training corpus is obtained after merging processing.

And 1023, setting labels for the first participles to obtain label sequences.

Step 1024, each first segmentation comprises a plurality of characters, characteristics are built for each character, characteristic vectors of the first segmentation are built by using the characteristics of the characters and the dictionary, and characteristic vectors of the segmentation training corpus are built by using the characteristic vectors of the first segmentation.

For each character, all tags constitute a tag sequence using the BMES tag system, i.e., B (beginning), M (middle), E (end), S (independent wording), e.g., "Pudongto Shanghai" with the tag "BMMMME" and "Yes" with the tag "S".

Considering the SEG-CRF characteristics of 2 characters before and after the current word segmentation and the character itself comprehensively, as shown in FIG. 3, the SEG-CRF characteristics include the character itself and the relevant information of the character type.

F3 has flown for flight. "in this case, considering 2 words before and after a word and the word itself (5 word features in total), and 6 attributes per word in the word itself and the word type shown in fig. 3, the feature size per word is a 5 × 6 matrix. According to the SEG-CRF characteristic, "flight F3 has flown. Each word in "constructs a feature:

Feature_vec＝{

"navigate": { -2: [ None, None ],

-1：[None,None,None,None,None,None],

0: [ "ship", False, True ],

1: [ "class", False, True ],

2：[“F”,False,False,True,False,False]

}

"class": { -2: [ None, None ],

-1: [ "ship", False, True ],

0: [ "class", False, True ],

1：[“F”,False,False,True,False,False],

2：[“3”,False,True,False,False,False]

}

……

as above, the 2 nd word "class" is taken as an example. Let "-2" be the index of the 2 nd word before "team", since the 1 st word before "team" is "aviation", the 2 nd word before is not, and is not set to None, to construct the feature of the 2 nd word before "team".

The 1 st word before "class" is "ship", and "-1" is used as the index of "ship" word, then the characteristic dimension listed in fig. 3 is used to construct the characteristic of "ship", and the "ship" word itself is "ship", which is not a space, marked as False, not a number, marked as False, not a letter, marked as False, not a punctuation mark, marked as False, and is a chinese character (i.e. a character except for a space, a number, a letter, and a punctuation mark), marked as True, and the characteristic of "class" and the 1 st word before "class" is constructed.

Similarly, 0 is used as the index of the "shift", 1 is used as the index of the 1 st word after the "shift", and 2 is used as the index of the 2 nd word after the "shift".

Or "flight F3 has flown. For example, the word has 7 words, each word considers the feature of 5 words (the feature of the front and back 2 words and the word itself), each word considers the attribute of 6 dimensions, and then the size of the feature vector (matrix) is the attribute of 7 dimensions.

Converting the constructed features into feature vectors, wherein each word, number or symbol is converted into an index (word _ index) of the word, number or symbol in a dictionary, False is 0, True is 1, and None is-1; with "flight F3 flown. For example, considering 2 words before and after a word and the word itself (5 word features in total), each word considers 6 attributes shown in fig. 3, and the feature size of each word is a 5-dimensional matrix.

And 103, training the first CRF model by using the feature vectors and the label sequences of the word segmentation training corpus to obtain a Chinese word segmentation model, wherein the Chinese word segmentation model is used for segmenting the aviation change information according to the dictionary to obtain word segmentation information. After the Chinese word segmentation model is studied, when the flight variation information is predicted, the Shanghai hongqiao airport is not divided into three words, but predicted into one word.

And 104, inputting aviation variation sample information to the Chinese word segmentation model to obtain an entity training corpus, labeling the entity training corpus according to preset flight entity information to obtain a labeling sequence, and constructing a feature vector for the entity training corpus, wherein the preset flight entity information is used for representing variation information of flights. As shown in table 1, the preset flight entity information of the present embodiment includes an original departure airport, an original departure date, an original departure time, an original arrival airport, an original arrival date, an original arrival time, a protected departure airport, a protected departure date, a protected departure time, a protected arrival airport, a protected arrival date, and a protected arrival time. The specific definition of the entity information can be flexibly changed according to the actual requirement.

TABLE 1

For example, the aeronautical variation sample information comprises information content '13: 40 flying from Shanghai to Beijing MH350 is adjusted to be 14: 50' due to weather, the information after word segmentation is marked by applying a blank: "13: 40 MH350, which was flying from Shanghai to Beijing, has been adjusted to 14: 50" for weather reasons, and constructs the tokenized information into a labeling sequence OF "(ODT, DEF, ODP, DEF, OAP, DEF, OF, DEF, PDT)" according to Table 1.

The entity corpus comprises a plurality of second participles, and preferably, the step of constructing the feature vector for the entity corpus specifically comprises the following steps: and respectively constructing characteristics for the plurality of second participles, and constructing a characteristic vector of the entity training corpus by using the characteristics of the second participles and the dictionary.

And 105, training a second CRF model by using the characteristic vector and the labeling sequence of the entity training corpus to obtain an entity analysis model, wherein the entity analysis model is used for analyzing the participle information according to the preset flight entity information to obtain the content of the preset flight entity information.

Taking an entity training corpus as an example of "13: 00 flying from Shanghai Pudong airport to Beijing", including a plurality of second participles separated by using spaces as marks, constructing features for all the second participles according to the feature attributes NER-CRF shown in FIG. 4, wherein the NER-CRF features include a plurality of attributes such as word itself, word type, default part of speech, relative positions of keywords and the like. The construction characteristics are as follows:

Feature_vec＝{

“13:00”：{-2：[None,None,None,None,None,None,None,None,None,None,None],

-1：[None,None,None,None,None,None,None,None,None,None,None],

0：[“13:00”,False,True,False,False,False,False,False,False,False,-1],

1: [ "by", False, True, False, -1],

2: [ "Shanghai Pudong airport", False, True, False, True, -1]

}

The following components in percentage by weight: { -2: [ None, None ],

-1：[“13:00”,False,True,False,False,False,False,False,False,False,-1],

0: [ "by", False, True, False, -1],

1: [ "Pudong airport in Shanghai", False, True, False, True, -1],

2: [ "fly to", False, True, False, -1]

}

……

And step 106, sequentially inputting the flight variation information into the Chinese word segmentation model and the entity analysis model to obtain the content of the preset flight entity information.

Step 104 further comprises:

and (3) marking the entity training corpus according to a role table formed by the names to obtain a role marking sequence, and training an HMM model by using the role marking sequence to obtain a name recognition model.

The names of people are concentrated with the words used in the context, and have strong regularity, the scope of the words used in the context of the names of people is limited, the above is generally called, vocalized and conjunctive, such as "respected", and the following is generally called as "passenger", "passenger" and "client".

Here, all words in a sentence are divided into names internal components, upper, lower, irrelevant words, etc., and named the constituent roles of the names, and the specific role classifications are shown in table 2 name constituent role table:

TABLE 2 name constitution role table

And performing pre-segmentation on the words to obtain pre-segmented word materials by utilizing data in the aeroderivative sample information, and performing word segmentation on the pre-segmented word materials through a Chinese word segmentation model and performing role labeling. The pre-segmentation linguistic data is like passengers like < name > Zhang three </name > and < name > Niaoeihua </name > of the relatives.

After word segmentation by the Chinese word segmentation model is that (marked with "/" between words), "love/three/and/cow/flower/etc/passenger". When the role is labeled, because the names of the pre-divided word corpus corpora are labeled, it can be known that "zhang san" and "niu di hua" are names, and the others are not, the labeling is performed according to the table 2 to obtain the corpus with the role label, for example, the word and the role label are marked by "|" segmentation, the space between the word and the word is marked by a space, and the "love/three/and/niu di hua/etc./passenger" is labeled as | K pieces | B three | E and | M niu | B two | C flower | D etc. | L travel | a | of "parent | a love | a.

And training the obtained labeled corpus by using an HMM model, and respectively calculating to obtain 3 parameters of the HMM model, such as an initial state probability vector pi, a state transition probability matrix A, an observation probability matrix Bt and the like.

Let Q be the set of all possible states (i.e., the set of role labels) and V be the set of all possible observations (i.e., the set of words in the aeronautical variation information).

Q＝{q1,q2,…,qN}，O＝{o1,o2,…,oM}

N is the number of states (number of character labels), N is 10, i.e., as shown in table 2, Q is { a, B, C, D, E, F, G, K, L, M }

M is the number of observations (words of the aeroderivative information).

A is the state probability matrix:

A＝[ai,j]N×N

where the state transition probability ai, j is P (it +1 qj | it qi), meaning the probability of the tag of the t-th word to the tag of the t + 1-th word.

B is an observation probability matrix B ═ bj (k) ] N × M

The observation probability bj (k) P (ot ═ vk | it ═ qj) means the probability that the t-th word is a certain label.

Pi is the initial state probability vector

π＝(πi)

Where pi i-P (i1 qi) means the probability that a label appears at the 1 st position.

The specific calculation step is that the training data (the data set size is S) in all the pre-participle linguistic data is traversed once.

The state transition probabilities are also derived by traversing through the data set. For example, if the label of "love" is a, and the label of "yes" is K, then the probability from one label a to one label K is calculated, the occurrence number # AK from all a to K is counted, and then the number # a of all labels a is counted, and the state transition probability from a to K is obtained by dividing # AK by # a. And by analogy, obtaining other state transition probabilities.

The observation probability can be obtained by traversing the data set once.

And (3) counting the observation probability of each word, for example, counting the total number of times # love of the 'love' word in the data set, wherein all 'love' is the number # love A of the label A, and dividing the # love A by the # love to obtain the observation probability that love is the label A.

"love" is the number of label A, # love E, then # love E divided by # love results in the observed probability that love is label E.

…

And so on.

In summary, only one training data set needs to be traversed, and parameters of the HMM model, the initial state probability vector pi, the state transition probability matrix a, and the observation probability matrix B can be obtained.

Correspondingly, step 106 further includes:

inputting the aeroderivative information into an HMM model, wherein the aeroderivative information comprises name information, and solving by using a Viterbi algorithm to obtain the name information in the aeroderivative information.

The method for extracting the aviation variation information constructs a dictionary according to a plurality of aviation variation sample information; training a first CRF model by using the aerovariant sample information according to a dictionary to obtain a Chinese word segmentation model; utilizing a Chinese word segmentation model to segment the aviation change sample information to obtain entity training corpus, and utilizing the entity training corpus to train a second CRF model to obtain an entity analysis model; and sequentially inputting the flight variation information into the Chinese word segmentation model and the entity analysis model to obtain the content of the flight entity information. And the HMM model is trained by utilizing the entity training corpus to obtain a name recognition model, and the name recognition model and the Viterbi algorithm are utilized to recognize names in the aviation change information.

The flight variant information extraction method can automatically recognize and extract flight entity information through the Chinese word segmentation model and the entity analysis model, and can recognize the names in the flight variant information by utilizing the name recognition model and the Viterbi algorithm, so that the flight entity information recognition efficiency is improved. The tests of a plurality of aviation transformer test sample information prove that the accuracy of extracting the flight entity information is greatly improved by utilizing the name recognition model, the Chinese word segmentation model and the entity analysis model trained by the aviation transformer sample information compared with a template matching method based on a regular expression.

Example 2

The embodiment provides a aeroderivative information extraction system, as shown in fig. 5, the aeroderivative information extraction system includes a dictionary construction module 201, a participle feature construction module 202, a chinese participle model training module 203, an entity feature construction module 204, an entity analysis model training module 205, and an entity analysis module 206.

The dictionary construction module 201 is used for constructing a dictionary according to a plurality of aviation variation sample information. Before the dictionary is constructed, the traditional Chinese coding format of a plurality of aviation change sample information can be converted into the simplified Chinese coding format.

The segmentation feature construction module 202 is configured to perform pre-segmentation on the aviation variation sample information according to a dictionary to obtain a segmentation training corpus; and the method is also used for setting a label sequence for the word segmentation training corpus and constructing a feature vector for the word segmentation training corpus.

Preferably, the segmentation feature construction module 202 is further configured to perform pre-segmentation on the aviation variation sample information to obtain a pre-segmentation corpus; the word segmentation training corpus is also used for merging the pre-segmentation word material according to the dictionary to obtain a segmentation training corpus, and the segmentation training corpus comprises a plurality of first segmentation words; setting labels for a plurality of first segmentation words respectively to obtain a label sequence, wherein each first segmentation word comprises a plurality of characters; the participle feature constructing module 202 is further configured to construct a feature for each character, construct a feature vector of a first participle using the features of the characters and the dictionary, and construct a feature vector of a participle training corpus using the feature vector of the first participle.

According to the contents of the dictionary records constructed in the previous step, words such as airports, flight numbers, time, dates and the like are merged and cleaned, and the words such as 'Shanghai', 'rainbow bridge' and 'airport' are merged into a word 'Shanghai rainbow bridge airport'; the six words of "2017", "year", "3", "month", "25" and "day" are combined into 1 word to represent a specific date.

In the step, the pre-divided word materials in the previous step are merged, namely cleaned, if three words of Shanghai, rainbow bridge and airport appear continuously in the pre-divided word materials, the three words are merged according to a dictionary, so that the fact that only the rainbow bridge airport appears in the pre-divided word materials and three independent words do not appear is ensured, and the word-dividing training language materials are obtained after the processing.

Feature_vec＝{

"navigate": { -2: [ None, None ],

-1：[None,None,None,None,None,None],

0: [ "ship", False, True ],

1: [ "class", False, True ],

2：[“F”,False,False,True,False,False]

}

"class": { -2: [ None, None ],

-1: [ "ship", False, True ],

0: [ "class", False, True ],

1：[“F”,False,False,True,False,False],

2：[“3”,False,True,False,False,False]

}

……

The Chinese word segmentation model training module 203 is configured to train the first CRF model by using the feature vectors and the tag sequences of the word segmentation training corpus to obtain a Chinese word segmentation model, and the Chinese word segmentation model is configured to perform word segmentation on the aviation change information according to the dictionary to obtain word segmentation information. After the Chinese word segmentation model is studied, when the flight variation information is predicted, the Shanghai hongqiao airport is not divided into three words, but predicted into one word.

The entity feature construction module 204 is configured to input the aviation variation sample information to the chinese word segmentation model to obtain an entity training corpus, label the entity training corpus according to preset flight entity information to obtain a labeling sequence, and construct a feature vector for the entity training corpus, where the preset flight entity information is used to represent variation information of flights. As shown in table 1, the preset flight entity information of the present embodiment includes an original departure airport, an original departure date, an original departure time, an original arrival airport, an original arrival date, an original arrival time, a protected departure airport, a protected departure date, a protected departure time, a protected arrival airport, a protected arrival date, and a protected arrival time. The specific definition of the entity information can be flexibly changed according to the actual requirement.

TABLE 1

The entity analysis feature training module 205 is configured to train a second CRF model by using the feature vector and the tagging sequence of the entity training corpus to obtain an entity analysis model, where the entity analysis model is configured to analyze the participle information according to the preset flight entity information to obtain the content of the preset flight entity information.

The entity analysis module 206 is configured to sequentially input the flight variation information into the Chinese word segmentation model and the entity analysis model to obtain the content of the preset flight entity information.

Feature_vec＝{

“13:00”：{-2：[None,None,None,None,None,None,None,None,None,None,None],

-1：[None,None,None,None,None,None,None,None,None,None,None],

0：[“13:00”,False,True,False,False,False,False,False,False,False,-1],

1: [ "by", False, True, False, -1],

2: [ "Shanghai Pudong airport", False, True, False, True, -1]

}

The following components in percentage by weight: { -2: [ None, None ],

-1：[“13:00”,False,True,False,False,False,False,False,False,False,-1],

0: [ "by", False, True, False, -1],

1: [ "Pudong airport in Shanghai", False, True, False, True, -1],

2: [ "fly to", False, True, False, -1]

}

……

The entity feature construction module 204 is further configured to label the entity training corpus according to a role table formed by names to obtain a role labeling sequence, and train an HMM model by using the role labeling sequence to obtain a name recognition model;

TABLE 2 name constitution role table

tags	Of significance	Examples of the present invention
			A	Other words of no relation	Lovely Zhang three passengers
B	Surname family name	Lovely Zhang three passengers
			C	First character of double names	Lovely cattle-Dahua passenger
D	Double-name last character	Lovely cattle-Dahua passenger
			E	Single name	Lovely Zhang three passengers
F	Prefix	Laoliu, Xiaowang
			G	Suffix	Wangwu Liu Ma
K	Above of name	Lovely cattle-Dahua passenger
			L	The following of the name of a person	Lovely cattle-Dahua passenger
M	Splitting between two Chinese names	Zhang Sanhe cattle and Dahua passenger

And training the obtained labeled corpus by using an HMM model, and respectively calculating to obtain 3 parameters of the HMM model, such as an initial state probability vector pi, a state transition probability matrix A, an observation probability matrix B and the like.

Q＝{q1,q2,…,qN}，O＝{o1,o2,…,oM}

M is the number of observations (words of the aeroderivative information).

A is the state probability matrix:

A＝[ai,j]N×N

B is an observation probability matrix B ═ bj (k) ] N × M

Pi is the initial state probability vector

π＝(πi)

The specific calculation step is that training data (the data set size is S) in all pre-segmented word corpora are traversed once to obtain the training data.

The observation probability can be obtained by traversing the data set once.

…

And so on.

The entity analysis module 206 is further configured to input the aeroderivative information to the HMM model, where the aeroderivative information includes name information, and the name information in the aeroderivative information is obtained by solving with a Viterbi algorithm.

The aeroderivative information extraction system constructs a dictionary according to a plurality of aeroderivative sample information; training a first CRF model by using the aerovariant sample information according to a dictionary to obtain a Chinese word segmentation model; utilizing a Chinese word segmentation model to segment the aviation change sample information to obtain entity training corpus, and utilizing the entity training corpus to train a second CRF model to obtain an entity analysis model; and sequentially inputting the flight variation information into the Chinese word segmentation model and the entity analysis model to obtain the content of the flight entity information. And the HMM model is trained by utilizing the entity training corpus to obtain a name recognition model, and the name recognition model and the Viterbi algorithm are utilized to recognize names in the aviation change information.

The flight variant information extraction system can automatically recognize and extract flight entity information through the Chinese word segmentation model and the entity analysis model, and can recognize the names in the flight variant information by using the name recognition model and the Viterbi algorithm, so that the flight entity information recognition efficiency is improved.

While specific embodiments of the invention have been described above, it will be appreciated by those skilled in the art that this is by way of example only, and that the scope of the invention is defined by the appended claims. Various changes and modifications to these embodiments may be made by those skilled in the art without departing from the spirit and scope of the invention, and these changes and modifications are within the scope of the invention.

Claims

1. A method for extracting aeroderivative information is characterized by comprising the following steps:

s2, carrying out word pre-segmentation on the aviation change sample information according to the dictionary to obtain word segmentation training corpora; setting a label sequence for the participle training corpus, and constructing a feature vector for the participle training corpus;

s3, training a first CRF model by using the feature vectors and the label sequences of the word segmentation training corpus to obtain a Chinese word segmentation model, wherein the Chinese word segmentation model is used for segmenting the aviation change information according to the dictionary to obtain word segmentation information;

s6, inputting flight variation information into the Chinese word segmentation model and the entity analysis model in sequence to obtain the content of the preset flight entity information;

step S2 includes:

2. The method for extracting aviation variation information as claimed in claim 1, wherein the entity corpus includes a plurality of second participles, and the step of constructing the feature vector for the entity corpus in step S4 includes:

3. The method for extracting aviation change information as claimed in claim 1, wherein in step S1, before the dictionary is constructed, a traditional chinese encoding format of some aviation change sample information is converted into a simplified chinese encoding format.

4. The navigation change information extraction method according to claim 1, wherein the step S4 further includes:

marking the entity training corpus according to a role table formed by names to obtain a role marking sequence, and training an HMM model by using the role marking sequence to obtain a name recognition model;

step S6 includes:

inputting the aeroderivative information into the HMM model, wherein the aeroderivative information comprises name information, and solving by using a Viterbi algorithm to obtain the name information in the aeroderivative information.

5. An aeroderivative information extraction system, comprising:

the entity analysis module is used for sequentially inputting the flight variation information into the Chinese word segmentation model and the entity analysis model to obtain the content of the preset flight entity information;

the word segmentation characteristic construction module is also used for carrying out word pre-segmentation on the aviation variation sample information to obtain word pre-segmentation corpora; the word segmentation training corpus is also used for merging the pre-word segmentation corpus according to the dictionary to obtain a word segmentation training corpus, and the word segmentation training corpus comprises a plurality of first words; setting labels for a plurality of first participles respectively to obtain a label sequence, wherein each first participle comprises a plurality of characters;

6. The aviation variation information extraction system of claim 5, wherein the entity corpus comprises a plurality of second participles, and the entity feature construction module is further configured to respectively construct features for the plurality of second participles, and construct feature vectors of the entity corpus by using the features of the second participles and the dictionary.

7. The aerovariant information extraction system of claim 5, wherein the dictionary construction module is further configured to convert a traditional Chinese coding format of the plurality of aerovariant sample information into a simplified Chinese coding format before constructing the dictionary.

8. The aviation change information extraction system of claim 5, wherein the entity feature construction module is further configured to label the entity training corpus according to a role table formed by names to obtain a role labeling sequence, and train an HMM model by using the role labeling sequence to obtain a name recognition model;

the entity analysis module is further used for inputting aviation variation information into the HMM model, wherein the aviation variation information comprises name information, and the name information in the aviation variation information is obtained through solving by a Viterbi algorithm.