CN108595430B - Aviation transformer information extraction method and system - Google Patents

Aviation transformer information extraction method and system Download PDF

Info

Publication number
CN108595430B
CN108595430B CN201810385920.8A CN201810385920A CN108595430B CN 108595430 B CN108595430 B CN 108595430B CN 201810385920 A CN201810385920 A CN 201810385920A CN 108595430 B CN108595430 B CN 108595430B
Authority
CN
China
Prior art keywords
information
entity
model
word segmentation
aviation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810385920.8A
Other languages
Chinese (zh)
Other versions
CN108595430A (en
Inventor
汪政
张勇
金丽丽
苏达鼐
曹媛媛
汪庆
陈之彦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ctrip Travel Network Technology Shanghai Co Ltd
Original Assignee
Ctrip Travel Network Technology Shanghai Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ctrip Travel Network Technology Shanghai Co Ltd filed Critical Ctrip Travel Network Technology Shanghai Co Ltd
Priority to CN201810385920.8A priority Critical patent/CN108595430B/en
Publication of CN108595430A publication Critical patent/CN108595430A/en
Application granted granted Critical
Publication of CN108595430B publication Critical patent/CN108595430B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a method and a system for extracting aeroderivative information, wherein the extraction method comprises the following steps: s1, constructing a dictionary according to the plurality of aviation variation sample information; s2, carrying out word pre-segmentation on the aviation change sample information according to the dictionary to obtain word segmentation training corpora; s3, training the first CRF model by utilizing the participle training corpus to obtain a Chinese participle model; s4, inputting aviation variation sample information to the Chinese word segmentation model to obtain an entity training corpus; s5, training a second CRF model by utilizing entity training corpora to obtain an entity analysis model; and S6, sequentially inputting the flight variation information into the Chinese word segmentation model and the entity analysis model to obtain the content of the preset flight entity information. The method improves the identification efficiency of the flight entity information, and tests of a plurality of aviation transformer test sample information prove that the accuracy of extracting the flight entity information is greatly improved by using a Chinese word segmentation model and an entity analysis model trained by the aviation transformer sample information compared with a template matching method based on a regular expression.

Description

Aviation transformer information extraction method and system
Technical Field
The invention relates to the field of word information processing, in particular to a method and a system for extracting aeroderivative information.
Background
After a traveler orders a travel ticket at an OTA (online travel agency) service terminal, the OTA service terminal usually sends related aviation information to a device terminal of the traveler to remind the traveler due to the fact that the traveler needs to adjust the original flight, including changing the model or route, canceling, advancing, interrupting, delaying or postponing flight and the like, because of weather, air traffic control, airline maintenance, flight scheduling and the like.
The flight change information usually includes entities and entity relationships, and the entities and entity relationships include passenger names (conventional Chinese names, transliterated foreign names, minority names, and the like), flight numbers, airports, dates, times, and other information, and relationships between related information before and after flight changes.
Analysis of aviation transformer information in the prior art, similar information is merged as a class, and a regular expression matching template is manually configured for each class of information, but the template matching method has the following problems:
the analysis of the aeroderivative information is inaccurate, so that part of effective information is deleted;
the aeroderivative information template changes frequently, and the maintenance difficulty is high;
the template which does not accord with the analysis condition is manually maintained and analyzed, and the workload is large.
The conventional technical solution to solve these problems at present is to adopt the chinese word segmentation and named entity recognition scheme in natural language processing technology. At present, many open-source natural language processing frameworks are based on open-source corpus learning and cannot be well applied to aeronautical transformation information analysis in the OTA field. The existing open source scheme mainly has the following difficulties:
the granularity of word segmentation cannot be self-adaptive according to application scenes. Named entity identification cannot distinguish entity relationships, such as related information content before and after a flight change and interrelationships of information content.
Disclosure of Invention
The invention aims to overcome the defects of low identification efficiency and low identification accuracy of related flight entity information in aviation transformer information in the prior art, and provides an aviation transformer information extraction method.
The invention solves the technical problems through the following technical scheme:
a method for extracting aeroderivative information comprises the following steps:
s1, constructing a dictionary according to the plurality of aviation variation sample information;
s2, performing word segmentation on the aviation variation sample information according to the dictionary to obtain pre-segmentation training corpus; setting a label sequence for the participle training corpus, and constructing a feature vector for the participle training corpus;
s3, training a first CRF (conditional random field) model by using the feature vectors and the label sequences of the word segmentation training corpus to obtain a Chinese word segmentation model, wherein the Chinese word segmentation model is used for segmenting the aviation change information according to the dictionary to obtain word segmentation information;
s4, inputting aviation variation sample information to the Chinese word segmentation model to obtain an entity training corpus, labeling the entity training corpus according to preset flight entity information to obtain a labeling sequence, and constructing a feature vector for the entity training corpus, wherein the preset flight entity information is used for representing variation information of flights;
s5, training a second CRF model by using the feature vectors and the labeling sequences of the entity training corpus to obtain an entity analysis model, wherein the entity analysis model is used for analyzing the participle information according to the preset flight entity information to obtain the content of the preset flight entity information;
and S6, sequentially inputting the flight variation information into the Chinese word segmentation model and the entity analysis model to obtain the content of the preset flight entity information.
Preferably, the step S2 includes:
s21, carrying out pre-segmentation on the aviation variation sample information to obtain pre-segmentation corpora;
s22, merging the pre-participle corpus according to the dictionary to obtain the participle training corpus, wherein the participle training corpus comprises a plurality of first participles;
s23, setting labels for the first participles respectively to obtain label sequences;
s24, each first participle comprises a plurality of characters, characteristics are built for each character, characteristic vectors of the first participle are built by using the characteristics of the characters and the dictionary, and characteristic vectors of the participle training corpus are built by using the characteristic vectors of the first participle.
Preferably, the entity corpus includes a plurality of second participles, and the step of constructing feature vectors for the entity corpus in step S4 includes:
and respectively constructing features for the plurality of second participles, and constructing feature vectors of the entity training corpus by using the features of the second participles and the dictionary.
Preferably, in the step S1, before the dictionary is constructed, the traditional chinese encoding format of the plurality of aviation change sample information is converted into the simplified chinese encoding format.
Preferably, the step S4 further includes:
marking the entity training corpus according to a role table formed by names to obtain a role marking sequence, and training an HMM (hidden Markov) model by using the role marking sequence to obtain a name recognition model;
the step S6 includes:
inputting the aerovariant information into the HMM model, wherein the aerovariant information comprises name information, and solving by using a Viterbi (dynamic rule algorithm) algorithm to obtain the name information in the aerovariant information.
An aeroderivative information extraction system, the aeroderivative information extraction system comprising:
the dictionary construction module is used for constructing a dictionary according to the plurality of aviation variation sample information;
the word segmentation characteristic construction module is used for carrying out word segmentation on the aviation change sample information according to the dictionary to obtain word segmentation training corpora; the system is also used for setting a label sequence for the participle training corpus and constructing a feature vector for the participle training corpus;
the Chinese word segmentation model training module is used for training a first CRF (learning control parameter) model by using the feature vectors and the label sequences of the word segmentation training corpus to obtain a Chinese word segmentation model, and the Chinese word segmentation model is used for segmenting the aviation change information according to the dictionary to obtain word segmentation information;
the entity feature construction module is used for inputting aviation variation sample information to the Chinese word segmentation model to obtain an entity training corpus, marking the entity training corpus according to preset flight entity information to obtain a marking sequence, and constructing a feature vector for the entity training corpus, wherein the preset flight entity information is used for representing variation information of flights;
the entity analysis model training module is used for training a second CRF (model reference frame) model by utilizing the characteristic vector and the tagging sequence of the entity training corpus to obtain an entity analysis model, and the entity analysis model is used for analyzing the word segmentation information according to the preset flight entity information to obtain the content of the preset flight entity information;
and the entity analysis module is used for sequentially inputting the flight variation information into the Chinese word segmentation model and the entity analysis model to obtain the content of the preset flight entity information.
Preferably, the word segmentation feature construction module is further configured to perform word pre-segmentation on the aviation variation sample information to obtain a pre-segmented word corpus; the word segmentation training corpus is also used for merging the pre-word segmentation corpus according to the dictionary to obtain a word segmentation training corpus, and the word segmentation training corpus comprises a plurality of first words; setting labels for a plurality of first participles respectively to obtain a label sequence, wherein each first participle comprises a plurality of characters;
the word segmentation feature construction module is further used for constructing features for each character, constructing a feature vector of the first segmentation by using the features of the characters and the dictionary, and constructing a feature vector of the word segmentation training corpus by using the feature vector of the first segmentation.
Preferably, the entity training corpus includes a plurality of second participles, and the entity feature construction module is further configured to respectively construct features for the plurality of second participles, and construct feature vectors of the entity training corpus by using the features of the second participles and the dictionary.
Preferably, the dictionary construction module is further configured to convert the traditional chinese coding format of the plurality of aviation change sample information into the simplified chinese coding format before constructing the dictionary.
Preferably, the entity feature construction module is further configured to label the entity training corpus according to a role table formed by names to obtain a role labeling sequence, and train an HMM model by using the role labeling sequence to obtain a name recognition model;
the entity analysis module is further used for inputting aviation variation information into the HMM model, wherein the aviation variation information comprises name information, and the name information in the aviation variation information is obtained through solving by a Viterbi algorithm. The positive progress effects of the invention are as follows:
the method for extracting the aviation variation information constructs a dictionary according to a plurality of aviation variation sample information; training a first CRF model by using the aerovariant sample information according to a dictionary to obtain a Chinese word segmentation model; utilizing a Chinese word segmentation model to segment the aviation change sample information to obtain entity training corpus, and utilizing the entity training corpus to train a second CRF model to obtain an entity analysis model; and sequentially inputting the flight variation information into the Chinese word segmentation model and the entity analysis model to obtain the content of the flight entity information. The flight variant information extraction method can automatically recognize and extract flight entity information through the Chinese word segmentation model and the entity analysis model, improves the flight entity information recognition efficiency, and greatly improves the accuracy of flight entity information extraction compared with a template matching method based on a regular expression by using the Chinese word segmentation model and the entity analysis model trained by the flight variant sample information through the test of a plurality of flight variant test sample information.
Drawings
Fig. 1 is a flowchart of a method for extracting aviation change information according to embodiment 1 of the present invention.
Fig. 2 is a flowchart of step 102 of the method for extracting aviation change information according to embodiment 1 of the present invention.
Fig. 3 is a schematic structural diagram of an SEG-CRF feature in the method for extracting aeronautical variation information according to embodiment 1 of the present invention.
Fig. 4 is a schematic diagram illustrating a structure of the NER-CRF feature in the method for extracting aeronautical variation information according to embodiment 1 of the present invention.
Fig. 5 is a schematic block diagram of a aeronautical variation information extraction system according to embodiment 2 of the present invention.
Detailed Description
The invention is further illustrated by the following examples, which are not intended to limit the scope of the invention.
The aviation change adjusts the original flight, including changing the model or route, canceling, advancing, interrupting, delaying or postponing the flight and the like, due to the reasons of weather, air traffic control, maintenance of the flight, flight dispatching and the like. In the case of a flight change, information to be notified to a traveler is referred to as flight change information.
Example 1
The present embodiment provides a method for extracting aeroderivative information, as shown in fig. 1, the method for extracting aeroderivative information includes:
step 101, constructing a dictionary according to a plurality of pieces of aviation variation sample information.
The characters in the aviation change sample information of the embodiment are preferably Chinese, and the Chinese includes traditional Chinese and simplified Chinese. Before the dictionary is constructed, the traditional Chinese coding format of a plurality of aviation change sample information can be converted into the simplified Chinese coding format. Firstly, judging whether the Chinese aviation change information in the original aviation change sample information is simplified or not, if the Chinese aviation change information is simplified or not, converting the Chinese aviation change information into simplified Chinese, otherwise, not converting the Chinese aviation change information, and then, performing full-angle to half-angle processing.
The word segmentation granularity in the prior art cannot be self-adaptive according to an application scene. The analysis of the aeroderivative information needs to be as long as possible in the length of segmentation of entities such as airports, dates and the like, for example, the 'Shanghai Pudong airport' cannot be segmented into three separate words of 'Shanghai', 'Pudong' and 'airport', and when a dictionary is constructed, the 'Shanghai Pudong airport' is constructed in the dictionary as one word.
102, carrying out word pre-segmentation on the aviation variation sample information according to a dictionary to obtain word segmentation training corpora; and setting a label sequence for the word segmentation training corpus, and constructing a feature vector for the word segmentation training corpus.
Preferably, as shown in fig. 2, step 102 specifically includes:
and step 1021, performing pre-segmentation on the aviation variation sample information to obtain pre-segmented word aggregates.
In this embodiment, an open source HanLP (natural language processing package) word segmentation tool is used to pre-segment the aviation variant information to obtain a pre-segmented word corpus. For example, if the aviation variation sample information contains the 'Shanghai rainbow bridge airport', three participles of 'Shanghai', 'rainbow bridge' and 'airport' can be obtained after pre-participling.
And 1022, merging the pre-divided word materials according to the dictionary to obtain a word segmentation training corpus, wherein the word segmentation training corpus comprises a plurality of first words.
Merging and cleaning words such as airports, flight numbers, time, dates and the like according to the contents of the dictionary records constructed in the previous step, such as merging the words 'Shanghai', 'rainbow bridge' and 'airport' into a word 'Shanghai rainbow bridge airport'; the six words of "2017", "year", "3", "month", "25" and "day" are combined into 1 word to represent a specific date.
As in the previous example, the aviation variation sample information includes a word "shanghai rainbow bridge airport", and after pre-word segmentation, three words "shanghai", "rainbow bridge" and "airport" are obtained, and the actually desired result is the combined "shanghai rainbow bridge airport".
In the step, pre-divided word materials in the previous step are merged, namely cleaned, for example, three words of Shanghai, rainbow bridge and airport continuously appear in the pre-divided word materials, the three words are merged according to a dictionary, so that the fact that only the rainbow bridge airport appears in the pre-divided word materials is ensured, three independent words do not appear, and the word-dividing training corpus is obtained after merging processing.
And 1023, setting labels for the first participles to obtain label sequences.
Step 1024, each first segmentation comprises a plurality of characters, characteristics are built for each character, characteristic vectors of the first segmentation are built by using the characteristics of the characters and the dictionary, and characteristic vectors of the segmentation training corpus are built by using the characteristic vectors of the first segmentation.
For each character, all tags constitute a tag sequence using the BMES tag system, i.e., B (beginning), M (middle), E (end), S (independent wording), e.g., "Pudongto Shanghai" with the tag "BMMMME" and "Yes" with the tag "S".
Considering the SEG-CRF characteristics of 2 characters before and after the current word segmentation and the character itself comprehensively, as shown in FIG. 3, the SEG-CRF characteristics include the character itself and the relevant information of the character type.
F3 has flown for flight. "in this case, considering 2 words before and after a word and the word itself (5 word features in total), and 6 attributes per word in the word itself and the word type shown in fig. 3, the feature size per word is a 5 × 6 matrix. According to the SEG-CRF characteristic, "flight F3 has flown. Each word in "constructs a feature:
Feature_vec={
"navigate": { -2: [ None, None ],
-1:[None,None,None,None,None,None],
0: [ "ship", False, True ],
1: [ "class", False, True ],
2:[“F”,False,False,True,False,False]
}
"class": { -2: [ None, None ],
-1: [ "ship", False, True ],
0: [ "class", False, True ],
1:[“F”,False,False,True,False,False],
2:[“3”,False,True,False,False,False]
}
……
as above, the 2 nd word "class" is taken as an example. Let "-2" be the index of the 2 nd word before "team", since the 1 st word before "team" is "aviation", the 2 nd word before is not, and is not set to None, to construct the feature of the 2 nd word before "team".
The 1 st word before "class" is "ship", and "-1" is used as the index of "ship" word, then the characteristic dimension listed in fig. 3 is used to construct the characteristic of "ship", and the "ship" word itself is "ship", which is not a space, marked as False, not a number, marked as False, not a letter, marked as False, not a punctuation mark, marked as False, and is a chinese character (i.e. a character except for a space, a number, a letter, and a punctuation mark), marked as True, and the characteristic of "class" and the 1 st word before "class" is constructed.
Similarly, 0 is used as the index of the "shift", 1 is used as the index of the 1 st word after the "shift", and 2 is used as the index of the 2 nd word after the "shift".
Or "flight F3 has flown. For example, the word has 7 words, each word considers the feature of 5 words (the feature of the front and back 2 words and the word itself), each word considers the attribute of 6 dimensions, and then the size of the feature vector (matrix) is the attribute of 7 dimensions.
Converting the constructed features into feature vectors, wherein each word, number or symbol is converted into an index (word _ index) of the word, number or symbol in a dictionary, False is 0, True is 1, and None is-1; with "flight F3 flown. For example, considering 2 words before and after a word and the word itself (5 word features in total), each word considers 6 attributes shown in fig. 3, and the feature size of each word is a 5-dimensional matrix.
And 103, training the first CRF model by using the feature vectors and the label sequences of the word segmentation training corpus to obtain a Chinese word segmentation model, wherein the Chinese word segmentation model is used for segmenting the aviation change information according to the dictionary to obtain word segmentation information. After the Chinese word segmentation model is studied, when the flight variation information is predicted, the Shanghai hongqiao airport is not divided into three words, but predicted into one word.
And 104, inputting aviation variation sample information to the Chinese word segmentation model to obtain an entity training corpus, labeling the entity training corpus according to preset flight entity information to obtain a labeling sequence, and constructing a feature vector for the entity training corpus, wherein the preset flight entity information is used for representing variation information of flights. As shown in table 1, the preset flight entity information of the present embodiment includes an original departure airport, an original departure date, an original departure time, an original arrival airport, an original arrival date, an original arrival time, a protected departure airport, a protected departure date, a protected departure time, a protected arrival airport, a protected arrival date, and a protected arrival time. The specific definition of the entity information can be flexibly changed according to the actual requirement.
TABLE 1
Figure BDA0001642202890000091
For example, the aeronautical variation sample information comprises information content '13: 40 flying from Shanghai to Beijing MH350 is adjusted to be 14: 50' due to weather, the information after word segmentation is marked by applying a blank: "13: 40 MH350, which was flying from Shanghai to Beijing, has been adjusted to 14: 50" for weather reasons, and constructs the tokenized information into a labeling sequence OF "(ODT, DEF, ODP, DEF, OAP, DEF, OF, DEF, PDT)" according to Table 1.
The entity corpus comprises a plurality of second participles, and preferably, the step of constructing the feature vector for the entity corpus specifically comprises the following steps: and respectively constructing characteristics for the plurality of second participles, and constructing a characteristic vector of the entity training corpus by using the characteristics of the second participles and the dictionary.
And 105, training a second CRF model by using the characteristic vector and the labeling sequence of the entity training corpus to obtain an entity analysis model, wherein the entity analysis model is used for analyzing the participle information according to the preset flight entity information to obtain the content of the preset flight entity information.
Taking an entity training corpus as an example of "13: 00 flying from Shanghai Pudong airport to Beijing", including a plurality of second participles separated by using spaces as marks, constructing features for all the second participles according to the feature attributes NER-CRF shown in FIG. 4, wherein the NER-CRF features include a plurality of attributes such as word itself, word type, default part of speech, relative positions of keywords and the like. The construction characteristics are as follows:
Feature_vec={
“13:00”:{-2:[None,None,None,None,None,None,None,None,None,None,None],
-1:[None,None,None,None,None,None,None,None,None,None,None],
0:[“13:00”,False,True,False,False,False,False,False,False,False,-1],
1: [ "by", False, True, False, -1],
2: [ "Shanghai Pudong airport", False, True, False, True, -1]
}
The following components in percentage by weight: { -2: [ None, None ],
-1:[“13:00”,False,True,False,False,False,False,False,False,False,-1],
0: [ "by", False, True, False, -1],
1: [ "Pudong airport in Shanghai", False, True, False, True, -1],
2: [ "fly to", False, True, False, -1]
}
……
And step 106, sequentially inputting the flight variation information into the Chinese word segmentation model and the entity analysis model to obtain the content of the preset flight entity information.
Step 104 further comprises:
and (3) marking the entity training corpus according to a role table formed by the names to obtain a role marking sequence, and training an HMM model by using the role marking sequence to obtain a name recognition model.
The names of people are concentrated with the words used in the context, and have strong regularity, the scope of the words used in the context of the names of people is limited, the above is generally called, vocalized and conjunctive, such as "respected", and the following is generally called as "passenger", "passenger" and "client".
Here, all words in a sentence are divided into names internal components, upper, lower, irrelevant words, etc., and named the constituent roles of the names, and the specific role classifications are shown in table 2 name constituent role table:
TABLE 2 name constitution role table
Figure BDA0001642202890000101
Figure BDA0001642202890000111
And performing pre-segmentation on the words to obtain pre-segmented word materials by utilizing data in the aeroderivative sample information, and performing word segmentation on the pre-segmented word materials through a Chinese word segmentation model and performing role labeling. The pre-segmentation linguistic data is like passengers like < name > Zhang three </name > and < name > Niaoeihua </name > of the relatives.
After word segmentation by the Chinese word segmentation model is that (marked with "/" between words), "love/three/and/cow/flower/etc/passenger". When the role is labeled, because the names of the pre-divided word corpus corpora are labeled, it can be known that "zhang san" and "niu di hua" are names, and the others are not, the labeling is performed according to the table 2 to obtain the corpus with the role label, for example, the word and the role label are marked by "|" segmentation, the space between the word and the word is marked by a space, and the "love/three/and/niu di hua/etc./passenger" is labeled as | K pieces | B three | E and | M niu | B two | C flower | D etc. | L travel | a | of "parent | a love | a.
And training the obtained labeled corpus by using an HMM model, and respectively calculating to obtain 3 parameters of the HMM model, such as an initial state probability vector pi, a state transition probability matrix A, an observation probability matrix Bt and the like.
Let Q be the set of all possible states (i.e., the set of role labels) and V be the set of all possible observations (i.e., the set of words in the aeronautical variation information).
Q={q1,q2,…,qN},O={o1,o2,…,oM}
N is the number of states (number of character labels), N is 10, i.e., as shown in table 2, Q is { a, B, C, D, E, F, G, K, L, M }
M is the number of observations (words of the aeroderivative information).
A is the state probability matrix:
A=[ai,j]N×N
where the state transition probability ai, j is P (it +1 qj | it qi), meaning the probability of the tag of the t-th word to the tag of the t + 1-th word.
B is an observation probability matrix B ═ bj (k) ] N × M
The observation probability bj (k) P (ot ═ vk | it ═ qj) means the probability that the t-th word is a certain label.
Pi is the initial state probability vector
π=(πi)
Where pi i-P (i1 qi) means the probability that a label appears at the 1 st position.
The specific calculation step is that the training data (the data set size is S) in all the pre-participle linguistic data is traversed once.
In the case of | L travel | a guest | a "such as | K pieces | B three | E and | M cattle | B two | C flowers | D of parent | a love | a, the parent is the 1 st word and the label is a. Calculating the initial state probability of the label A, namely counting the number # A of times that the state of the first word in all data sets is A, dividing the number # A by the size S of the data set to obtain the initial probability of the label A, and repeating the steps to obtain the probabilities of other labels.
The state transition probabilities are also derived by traversing through the data set. For example, if the label of "love" is a, and the label of "yes" is K, then the probability from one label a to one label K is calculated, the occurrence number # AK from all a to K is counted, and then the number # a of all labels a is counted, and the state transition probability from a to K is obtained by dividing # AK by # a. And by analogy, obtaining other state transition probabilities.
The observation probability can be obtained by traversing the data set once.
For example, | L travel | A guest | A "such as | K pieces | B love | E and | M cattle | B two | C flowers | D of parent | A love | A.
And (3) counting the observation probability of each word, for example, counting the total number of times # love of the 'love' word in the data set, wherein all 'love' is the number # love A of the label A, and dividing the # love A by the # love to obtain the observation probability that love is the label A.
"love" is the number of label A, # love E, then # love E divided by # love results in the observed probability that love is label E.
And so on.
In summary, only one training data set needs to be traversed, and parameters of the HMM model, the initial state probability vector pi, the state transition probability matrix a, and the observation probability matrix B can be obtained.
Correspondingly, step 106 further includes:
inputting the aeroderivative information into an HMM model, wherein the aeroderivative information comprises name information, and solving by using a Viterbi algorithm to obtain the name information in the aeroderivative information.
For example, the flight variation information "kindly-loved guests such as Zhang Wu and ox three flowers", the optimal tag sequence result obtained by using the Viterbi algorithm is "parent | A | K pieces | B five | E and | M ox | B three | C flowers | D | L travel | A". The name can be judged according to the two labels B and E in the table 2, wherein B is the name, E is the tail of the name, and the middle of B and E belongs to the name component.
The method for extracting the aviation variation information constructs a dictionary according to a plurality of aviation variation sample information; training a first CRF model by using the aerovariant sample information according to a dictionary to obtain a Chinese word segmentation model; utilizing a Chinese word segmentation model to segment the aviation change sample information to obtain entity training corpus, and utilizing the entity training corpus to train a second CRF model to obtain an entity analysis model; and sequentially inputting the flight variation information into the Chinese word segmentation model and the entity analysis model to obtain the content of the flight entity information. And the HMM model is trained by utilizing the entity training corpus to obtain a name recognition model, and the name recognition model and the Viterbi algorithm are utilized to recognize names in the aviation change information.
The flight variant information extraction method can automatically recognize and extract flight entity information through the Chinese word segmentation model and the entity analysis model, and can recognize the names in the flight variant information by utilizing the name recognition model and the Viterbi algorithm, so that the flight entity information recognition efficiency is improved. The tests of a plurality of aviation transformer test sample information prove that the accuracy of extracting the flight entity information is greatly improved by utilizing the name recognition model, the Chinese word segmentation model and the entity analysis model trained by the aviation transformer sample information compared with a template matching method based on a regular expression.
Example 2
The embodiment provides a aeroderivative information extraction system, as shown in fig. 5, the aeroderivative information extraction system includes a dictionary construction module 201, a participle feature construction module 202, a chinese participle model training module 203, an entity feature construction module 204, an entity analysis model training module 205, and an entity analysis module 206.
The dictionary construction module 201 is used for constructing a dictionary according to a plurality of aviation variation sample information. Before the dictionary is constructed, the traditional Chinese coding format of a plurality of aviation change sample information can be converted into the simplified Chinese coding format.
The characters in the aviation change sample information of the embodiment are preferably Chinese, and the Chinese includes traditional Chinese and simplified Chinese. Before the dictionary is constructed, the traditional Chinese coding format of a plurality of aviation change sample information can be converted into the simplified Chinese coding format. Firstly, judging whether the Chinese aviation change information in the original aviation change sample information is simplified or not, if the Chinese aviation change information is simplified or not, converting the Chinese aviation change information into simplified Chinese, otherwise, not converting the Chinese aviation change information, and then, performing full-angle to half-angle processing.
The word segmentation granularity in the prior art cannot be self-adaptive according to an application scene. The analysis of the aeroderivative information needs to be as long as possible in the length of segmentation of entities such as airports, dates and the like, for example, the 'Shanghai Pudong airport' cannot be segmented into three separate words of 'Shanghai', 'Pudong' and 'airport', and when a dictionary is constructed, the 'Shanghai Pudong airport' is constructed in the dictionary as one word.
The segmentation feature construction module 202 is configured to perform pre-segmentation on the aviation variation sample information according to a dictionary to obtain a segmentation training corpus; and the method is also used for setting a label sequence for the word segmentation training corpus and constructing a feature vector for the word segmentation training corpus.
Preferably, the segmentation feature construction module 202 is further configured to perform pre-segmentation on the aviation variation sample information to obtain a pre-segmentation corpus; the word segmentation training corpus is also used for merging the pre-segmentation word material according to the dictionary to obtain a segmentation training corpus, and the segmentation training corpus comprises a plurality of first segmentation words; setting labels for a plurality of first segmentation words respectively to obtain a label sequence, wherein each first segmentation word comprises a plurality of characters; the participle feature constructing module 202 is further configured to construct a feature for each character, construct a feature vector of a first participle using the features of the characters and the dictionary, and construct a feature vector of a participle training corpus using the feature vector of the first participle.
In this embodiment, an open source HanLP (natural language processing package) word segmentation tool is used to pre-segment the aviation variant information to obtain a pre-segmented word corpus. For example, if the aviation variation sample information contains the 'Shanghai rainbow bridge airport', three participles of 'Shanghai', 'rainbow bridge' and 'airport' can be obtained after pre-participling.
According to the contents of the dictionary records constructed in the previous step, words such as airports, flight numbers, time, dates and the like are merged and cleaned, and the words such as 'Shanghai', 'rainbow bridge' and 'airport' are merged into a word 'Shanghai rainbow bridge airport'; the six words of "2017", "year", "3", "month", "25" and "day" are combined into 1 word to represent a specific date.
As in the previous example, the aviation variation sample information includes a word "shanghai rainbow bridge airport", and after pre-word segmentation, three words "shanghai", "rainbow bridge" and "airport" are obtained, and the actually desired result is the combined "shanghai rainbow bridge airport".
In the step, the pre-divided word materials in the previous step are merged, namely cleaned, if three words of Shanghai, rainbow bridge and airport appear continuously in the pre-divided word materials, the three words are merged according to a dictionary, so that the fact that only the rainbow bridge airport appears in the pre-divided word materials and three independent words do not appear is ensured, and the word-dividing training language materials are obtained after the processing.
For each character, all tags constitute a tag sequence using the BMES tag system, i.e., B (beginning), M (middle), E (end), S (independent wording), e.g., "Pudongto Shanghai" with the tag "BMMMME" and "Yes" with the tag "S".
Considering the SEG-CRF characteristics of 2 characters before and after the current word segmentation and the character itself comprehensively, as shown in FIG. 3, the SEG-CRF characteristics include the character itself and the relevant information of the character type.
F3 has flown for flight. "in this case, considering 2 words before and after a word and the word itself (5 word features in total), and 6 attributes per word in the word itself and the word type shown in fig. 3, the feature size per word is a 5 × 6 matrix. According to the SEG-CRF characteristic, "flight F3 has flown. Each word in "constructs a feature:
Feature_vec={
"navigate": { -2: [ None, None ],
-1:[None,None,None,None,None,None],
0: [ "ship", False, True ],
1: [ "class", False, True ],
2:[“F”,False,False,True,False,False]
}
"class": { -2: [ None, None ],
-1: [ "ship", False, True ],
0: [ "class", False, True ],
1:[“F”,False,False,True,False,False],
2:[“3”,False,True,False,False,False]
}
……
as above, the 2 nd word "class" is taken as an example. Let "-2" be the index of the 2 nd word before "team", since the 1 st word before "team" is "aviation", the 2 nd word before is not, and is not set to None, to construct the feature of the 2 nd word before "team".
The 1 st word before "class" is "ship", and "-1" is used as the index of "ship" word, then the characteristic dimension listed in fig. 3 is used to construct the characteristic of "ship", and the "ship" word itself is "ship", which is not a space, marked as False, not a number, marked as False, not a letter, marked as False, not a punctuation mark, marked as False, and is a chinese character (i.e. a character except for a space, a number, a letter, and a punctuation mark), marked as True, and the characteristic of "class" and the 1 st word before "class" is constructed.
Similarly, 0 is used as the index of the "shift", 1 is used as the index of the 1 st word after the "shift", and 2 is used as the index of the 2 nd word after the "shift".
Or "flight F3 has flown. For example, the word has 7 words, each word considers the feature of 5 words (the feature of the front and back 2 words and the word itself), each word considers the attribute of 6 dimensions, and then the size of the feature vector (matrix) is the attribute of 7 dimensions.
Converting the constructed features into feature vectors, wherein each word, number or symbol is converted into an index (word _ index) of the word, number or symbol in a dictionary, False is 0, True is 1, and None is-1; with "flight F3 flown. For example, considering 2 words before and after a word and the word itself (5 word features in total), each word considers 6 attributes shown in fig. 3, and the feature size of each word is a 5-dimensional matrix.
The Chinese word segmentation model training module 203 is configured to train the first CRF model by using the feature vectors and the tag sequences of the word segmentation training corpus to obtain a Chinese word segmentation model, and the Chinese word segmentation model is configured to perform word segmentation on the aviation change information according to the dictionary to obtain word segmentation information. After the Chinese word segmentation model is studied, when the flight variation information is predicted, the Shanghai hongqiao airport is not divided into three words, but predicted into one word.
The entity feature construction module 204 is configured to input the aviation variation sample information to the chinese word segmentation model to obtain an entity training corpus, label the entity training corpus according to preset flight entity information to obtain a labeling sequence, and construct a feature vector for the entity training corpus, where the preset flight entity information is used to represent variation information of flights. As shown in table 1, the preset flight entity information of the present embodiment includes an original departure airport, an original departure date, an original departure time, an original arrival airport, an original arrival date, an original arrival time, a protected departure airport, a protected departure date, a protected departure time, a protected arrival airport, a protected arrival date, and a protected arrival time. The specific definition of the entity information can be flexibly changed according to the actual requirement.
TABLE 1
Figure BDA0001642202890000171
For example, the aeronautical variation sample information comprises information content '13: 40 flying from Shanghai to Beijing MH350 is adjusted to be 14: 50' due to weather, the information after word segmentation is marked by applying a blank: "13: 40 MH350, which was flying from Shanghai to Beijing, has been adjusted to 14: 50" for weather reasons, and constructs the tokenized information into a labeling sequence OF "(ODT, DEF, ODP, DEF, OAP, DEF, OF, DEF, PDT)" according to Table 1.
The entity corpus comprises a plurality of second participles, and preferably, the step of constructing the feature vector for the entity corpus specifically comprises the following steps: and respectively constructing characteristics for the plurality of second participles, and constructing a characteristic vector of the entity training corpus by using the characteristics of the second participles and the dictionary.
The entity analysis feature training module 205 is configured to train a second CRF model by using the feature vector and the tagging sequence of the entity training corpus to obtain an entity analysis model, where the entity analysis model is configured to analyze the participle information according to the preset flight entity information to obtain the content of the preset flight entity information.
The entity analysis module 206 is configured to sequentially input the flight variation information into the Chinese word segmentation model and the entity analysis model to obtain the content of the preset flight entity information.
Taking an entity training corpus as an example of "13: 00 flying from Shanghai Pudong airport to Beijing", including a plurality of second participles separated by using spaces as marks, constructing features for all the second participles according to the feature attributes NER-CRF shown in FIG. 4, wherein the NER-CRF features include a plurality of attributes such as word itself, word type, default part of speech, relative positions of keywords and the like. The construction characteristics are as follows:
Feature_vec={
“13:00”:{-2:[None,None,None,None,None,None,None,None,None,None,None],
-1:[None,None,None,None,None,None,None,None,None,None,None],
0:[“13:00”,False,True,False,False,False,False,False,False,False,-1],
1: [ "by", False, True, False, -1],
2: [ "Shanghai Pudong airport", False, True, False, True, -1]
}
The following components in percentage by weight: { -2: [ None, None ],
-1:[“13:00”,False,True,False,False,False,False,False,False,False,-1],
0: [ "by", False, True, False, -1],
1: [ "Pudong airport in Shanghai", False, True, False, True, -1],
2: [ "fly to", False, True, False, -1]
}
……
The entity feature construction module 204 is further configured to label the entity training corpus according to a role table formed by names to obtain a role labeling sequence, and train an HMM model by using the role labeling sequence to obtain a name recognition model;
the names of people are concentrated with the words used in the context, and have strong regularity, the scope of the words used in the context of the names of people is limited, the above is generally called, vocalized and conjunctive, such as "respected", and the following is generally called as "passenger", "passenger" and "client".
Here, all words in a sentence are divided into names internal components, upper, lower, irrelevant words, etc., and named the constituent roles of the names, and the specific role classifications are shown in table 2 name constituent role table:
TABLE 2 name constitution role table
tags Of significance Examples of the present invention
A Other words of no relation Lovely Zhang three passengers
B Surname family name Lovely Zhang three passengers
C First character of double names Lovely cattle-Dahua passenger
D Double-name last character Lovely cattle-Dahua passenger
E Single name Lovely Zhang three passengers
F Prefix Laoliu, Xiaowang
G Suffix Wangwu Liu Ma
K Above of name Lovely cattle-Dahua passenger
L The following of the name of a person Lovely cattle-Dahua passenger
M Splitting between two Chinese names Zhang Sanhe cattle and Dahua passenger
And performing pre-segmentation on the words to obtain pre-segmented word materials by utilizing data in the aeroderivative sample information, and performing word segmentation on the pre-segmented word materials through a Chinese word segmentation model and performing role labeling. The pre-segmentation linguistic data is like passengers like < name > Zhang three </name > and < name > Niaoeihua </name > of the relatives.
After word segmentation by the Chinese word segmentation model is that (marked with "/" between words), "love/three/and/cow/flower/etc/passenger". When the role is labeled, because the names of the pre-divided word corpus corpora are labeled, it can be known that "zhang san" and "niu di hua" are names, and the others are not, the labeling is performed according to the table 2 to obtain the corpus with the role label, for example, the word and the role label are marked by "|" segmentation, the space between the word and the word is marked by a space, and the "love/three/and/niu di hua/etc./passenger" is labeled as | K pieces | B three | E and | M niu | B two | C flower | D etc. | L travel | a | of "parent | a love | a.
And training the obtained labeled corpus by using an HMM model, and respectively calculating to obtain 3 parameters of the HMM model, such as an initial state probability vector pi, a state transition probability matrix A, an observation probability matrix B and the like.
Let Q be the set of all possible states (i.e., the set of role labels) and V be the set of all possible observations (i.e., the set of words in the aeronautical variation information).
Q={q1,q2,…,qN},O={o1,o2,…,oM}
N is the number of states (number of character labels), N is 10, i.e., as shown in table 2, Q is { a, B, C, D, E, F, G, K, L, M }
M is the number of observations (words of the aeroderivative information).
A is the state probability matrix:
A=[ai,j]N×N
where the state transition probability ai, j is P (it +1 qj | it qi), meaning the probability of the tag of the t-th word to the tag of the t + 1-th word.
B is an observation probability matrix B ═ bj (k) ] N × M
The observation probability bj (k) P (ot ═ vk | it ═ qj) means the probability that the t-th word is a certain label.
Pi is the initial state probability vector
π=(πi)
Where pi i-P (i1 qi) means the probability that a label appears at the 1 st position.
The specific calculation step is that training data (the data set size is S) in all pre-segmented word corpora are traversed once to obtain the training data.
In the case of | L travel | a guest | a "such as | K pieces | B three | E and | M cattle | B two | C flowers | D of parent | a love | a, the parent is the 1 st word and the label is a. Calculating the initial state probability of the label A, namely counting the number # A of times that the state of the first word in all data sets is A, dividing the number # A by the size S of the data set to obtain the initial probability of the label A, and repeating the steps to obtain the probabilities of other labels.
The state transition probabilities are also derived by traversing through the data set. For example, if the label of "love" is a, and the label of "yes" is K, then the probability from one label a to one label K is calculated, the occurrence number # AK from all a to K is counted, and then the number # a of all labels a is counted, and the state transition probability from a to K is obtained by dividing # AK by # a. And by analogy, obtaining other state transition probabilities.
The observation probability can be obtained by traversing the data set once.
For example, | L travel | A guest | A "such as | K pieces | B love | E and | M cattle | B two | C flowers | D of parent | A love | A.
And (3) counting the observation probability of each word, for example, counting the total number of times # love of the 'love' word in the data set, wherein all 'love' is the number # love A of the label A, and dividing the # love A by the # love to obtain the observation probability that love is the label A.
"love" is the number of label A, # love E, then # love E divided by # love results in the observed probability that love is label E.
And so on.
In summary, only one training data set needs to be traversed, and parameters of the HMM model, the initial state probability vector pi, the state transition probability matrix a, and the observation probability matrix B can be obtained.
The entity analysis module 206 is further configured to input the aeroderivative information to the HMM model, where the aeroderivative information includes name information, and the name information in the aeroderivative information is obtained by solving with a Viterbi algorithm.
For example, the flight variation information "kindly-loved guests such as Zhang Wu and ox three flowers", the optimal tag sequence result obtained by using the Viterbi algorithm is "parent | A | K pieces | B five | E and | M ox | B three | C flowers | D | L travel | A". The name can be judged according to the two labels B and E in the table 2, wherein B is the name, E is the tail of the name, and the middle of B and E belongs to the name component.
The aeroderivative information extraction system constructs a dictionary according to a plurality of aeroderivative sample information; training a first CRF model by using the aerovariant sample information according to a dictionary to obtain a Chinese word segmentation model; utilizing a Chinese word segmentation model to segment the aviation change sample information to obtain entity training corpus, and utilizing the entity training corpus to train a second CRF model to obtain an entity analysis model; and sequentially inputting the flight variation information into the Chinese word segmentation model and the entity analysis model to obtain the content of the flight entity information. And the HMM model is trained by utilizing the entity training corpus to obtain a name recognition model, and the name recognition model and the Viterbi algorithm are utilized to recognize names in the aviation change information.
The flight variant information extraction system can automatically recognize and extract flight entity information through the Chinese word segmentation model and the entity analysis model, and can recognize the names in the flight variant information by using the name recognition model and the Viterbi algorithm, so that the flight entity information recognition efficiency is improved.
While specific embodiments of the invention have been described above, it will be appreciated by those skilled in the art that this is by way of example only, and that the scope of the invention is defined by the appended claims. Various changes and modifications to these embodiments may be made by those skilled in the art without departing from the spirit and scope of the invention, and these changes and modifications are within the scope of the invention.

Claims (8)

1. A method for extracting aeroderivative information is characterized by comprising the following steps:
s1, constructing a dictionary according to the plurality of aviation variation sample information;
s2, carrying out word pre-segmentation on the aviation change sample information according to the dictionary to obtain word segmentation training corpora; setting a label sequence for the participle training corpus, and constructing a feature vector for the participle training corpus;
s3, training a first CRF model by using the feature vectors and the label sequences of the word segmentation training corpus to obtain a Chinese word segmentation model, wherein the Chinese word segmentation model is used for segmenting the aviation change information according to the dictionary to obtain word segmentation information;
s4, inputting aviation variation sample information to the Chinese word segmentation model to obtain an entity training corpus, labeling the entity training corpus according to preset flight entity information to obtain a labeling sequence, and constructing a feature vector for the entity training corpus, wherein the preset flight entity information is used for representing variation information of flights;
s5, training a second CRF model by using the feature vectors and the labeling sequences of the entity training corpus to obtain an entity analysis model, wherein the entity analysis model is used for analyzing the participle information according to the preset flight entity information to obtain the content of the preset flight entity information;
s6, inputting flight variation information into the Chinese word segmentation model and the entity analysis model in sequence to obtain the content of the preset flight entity information;
step S2 includes:
s21, carrying out pre-segmentation on the aviation variation sample information to obtain pre-segmentation corpora;
s22, merging the pre-participle corpus according to the dictionary to obtain the participle training corpus, wherein the participle training corpus comprises a plurality of first participles;
s23, setting labels for the first participles respectively to obtain label sequences;
s24, each first participle comprises a plurality of characters, characteristics are built for each character, characteristic vectors of the first participle are built by using the characteristics of the characters and the dictionary, and characteristic vectors of the participle training corpus are built by using the characteristic vectors of the first participle.
2. The method for extracting aviation variation information as claimed in claim 1, wherein the entity corpus includes a plurality of second participles, and the step of constructing the feature vector for the entity corpus in step S4 includes:
and respectively constructing features for the plurality of second participles, and constructing feature vectors of the entity training corpus by using the features of the second participles and the dictionary.
3. The method for extracting aviation change information as claimed in claim 1, wherein in step S1, before the dictionary is constructed, a traditional chinese encoding format of some aviation change sample information is converted into a simplified chinese encoding format.
4. The navigation change information extraction method according to claim 1, wherein the step S4 further includes:
marking the entity training corpus according to a role table formed by names to obtain a role marking sequence, and training an HMM model by using the role marking sequence to obtain a name recognition model;
step S6 includes:
inputting the aeroderivative information into the HMM model, wherein the aeroderivative information comprises name information, and solving by using a Viterbi algorithm to obtain the name information in the aeroderivative information.
5. An aeroderivative information extraction system, comprising:
the dictionary construction module is used for constructing a dictionary according to the plurality of aviation variation sample information;
the word segmentation characteristic construction module is used for carrying out word segmentation on the aviation change sample information according to the dictionary to obtain word segmentation training corpora; the system is also used for setting a label sequence for the participle training corpus and constructing a feature vector for the participle training corpus;
the Chinese word segmentation model training module is used for training a first CRF (learning control parameter) model by using the feature vectors and the label sequences of the word segmentation training corpus to obtain a Chinese word segmentation model, and the Chinese word segmentation model is used for segmenting the aviation change information according to the dictionary to obtain word segmentation information;
the entity feature construction module is used for inputting aviation variation sample information to the Chinese word segmentation model to obtain an entity training corpus, marking the entity training corpus according to preset flight entity information to obtain a marking sequence, and constructing a feature vector for the entity training corpus, wherein the preset flight entity information is used for representing variation information of flights;
the entity analysis model training module is used for training a second CRF (model reference frame) model by utilizing the characteristic vector and the tagging sequence of the entity training corpus to obtain an entity analysis model, and the entity analysis model is used for analyzing the word segmentation information according to the preset flight entity information to obtain the content of the preset flight entity information;
the entity analysis module is used for sequentially inputting the flight variation information into the Chinese word segmentation model and the entity analysis model to obtain the content of the preset flight entity information;
the word segmentation characteristic construction module is also used for carrying out word pre-segmentation on the aviation variation sample information to obtain word pre-segmentation corpora; the word segmentation training corpus is also used for merging the pre-word segmentation corpus according to the dictionary to obtain a word segmentation training corpus, and the word segmentation training corpus comprises a plurality of first words; setting labels for a plurality of first participles respectively to obtain a label sequence, wherein each first participle comprises a plurality of characters;
the word segmentation feature construction module is further used for constructing features for each character, constructing a feature vector of the first segmentation by using the features of the characters and the dictionary, and constructing a feature vector of the word segmentation training corpus by using the feature vector of the first segmentation.
6. The aviation variation information extraction system of claim 5, wherein the entity corpus comprises a plurality of second participles, and the entity feature construction module is further configured to respectively construct features for the plurality of second participles, and construct feature vectors of the entity corpus by using the features of the second participles and the dictionary.
7. The aerovariant information extraction system of claim 5, wherein the dictionary construction module is further configured to convert a traditional Chinese coding format of the plurality of aerovariant sample information into a simplified Chinese coding format before constructing the dictionary.
8. The aviation change information extraction system of claim 5, wherein the entity feature construction module is further configured to label the entity training corpus according to a role table formed by names to obtain a role labeling sequence, and train an HMM model by using the role labeling sequence to obtain a name recognition model;
the entity analysis module is further used for inputting aviation variation information into the HMM model, wherein the aviation variation information comprises name information, and the name information in the aviation variation information is obtained through solving by a Viterbi algorithm.
CN201810385920.8A 2018-04-26 2018-04-26 Aviation transformer information extraction method and system Active CN108595430B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810385920.8A CN108595430B (en) 2018-04-26 2018-04-26 Aviation transformer information extraction method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810385920.8A CN108595430B (en) 2018-04-26 2018-04-26 Aviation transformer information extraction method and system

Publications (2)

Publication Number Publication Date
CN108595430A CN108595430A (en) 2018-09-28
CN108595430B true CN108595430B (en) 2022-02-22

Family

ID=63610298

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810385920.8A Active CN108595430B (en) 2018-04-26 2018-04-26 Aviation transformer information extraction method and system

Country Status (1)

Country Link
CN (1) CN108595430B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110222340B (en) * 2019-06-06 2023-04-18 掌阅科技股份有限公司 Training method of book figure name recognition model, electronic device and storage medium
CN110782380B (en) * 2019-10-30 2022-07-08 北京创鑫旅程网络技术有限公司 Aviation change information management method and device and storage medium
CN111914538B (en) * 2020-07-31 2024-05-31 长江航道测量中心 Channel notification information intelligent space matching method and system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103077164A (en) * 2012-12-27 2013-05-01 新浪网技术(中国)有限公司 Text analysis method and text analyzer
CN103995885A (en) * 2014-05-29 2014-08-20 百度在线网络技术(北京)有限公司 Method and device for recognizing entity names
CN104615589A (en) * 2015-02-15 2015-05-13 百度在线网络技术(北京)有限公司 Named-entity recognition model training method and named-entity recognition method and device
CN104933152A (en) * 2015-06-24 2015-09-23 北京京东尚科信息技术有限公司 Named entity recognition method and device
CN106980608A (en) * 2017-03-16 2017-07-25 四川大学 A kind of Chinese electronic health record participle and name entity recognition method and system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107291700A (en) * 2017-07-17 2017-10-24 广州特道信息科技有限公司 Entity word recognition method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103077164A (en) * 2012-12-27 2013-05-01 新浪网技术(中国)有限公司 Text analysis method and text analyzer
CN103995885A (en) * 2014-05-29 2014-08-20 百度在线网络技术(北京)有限公司 Method and device for recognizing entity names
CN104615589A (en) * 2015-02-15 2015-05-13 百度在线网络技术(北京)有限公司 Named-entity recognition model training method and named-entity recognition method and device
CN104933152A (en) * 2015-06-24 2015-09-23 北京京东尚科信息技术有限公司 Named entity recognition method and device
CN106980608A (en) * 2017-03-16 2017-07-25 四川大学 A kind of Chinese electronic health record participle and name entity recognition method and system

Also Published As

Publication number Publication date
CN108595430A (en) 2018-09-28

Similar Documents

Publication Publication Date Title
CN110019839B (en) Medical knowledge graph construction method and system based on neural network and remote supervision
CN111104498B (en) Semantic understanding method in task type dialogue system
CN108959242B (en) Target entity identification method and device based on part-of-speech characteristics of Chinese characters
CN107766371B (en) Text information classification method and device
CN110909548B (en) Chinese named entity recognition method, device and computer readable storage medium
CN107291783B (en) Semantic matching method and intelligent equipment
CN104503998B (en) For the kind identification method and device of user query sentence
CN108595430B (en) Aviation transformer information extraction method and system
CN108664474B (en) Resume analysis method based on deep learning
CN109635288A (en) A kind of resume abstracting method based on deep neural network
CN111428480B (en) Resume identification method, device, equipment and storage medium
CN112257452B (en) Training method, training device, training equipment and training storage medium for emotion recognition model
CN109657039B (en) Work history information extraction method based on double-layer BilSTM-CRF
CN108829823A (en) A kind of file classification method
CN109582788A (en) Comment spam training, recognition methods, device, equipment and readable storage medium storing program for executing
CN107133212A (en) It is a kind of that recognition methods is contained based on integrated study and the text of words and phrases integrated information
CN108763192B (en) Entity relation extraction method and device for text processing
CN113268615A (en) Resource label generation method and device, electronic equipment and storage medium
CN105912525A (en) Sentiment classification method for semi-supervised learning based on theme characteristics
CN115795056A (en) Method, server and storage medium for constructing knowledge graph by unstructured information
CN113420548A (en) Entity extraction sampling method based on knowledge distillation and PU learning
CN109446523A (en) Entity attribute extraction model based on BiLSTM and condition random field
CN110069771B (en) Control instruction information processing method based on semantic chunk
CN107894976A (en) A kind of mixing language material segmenting method based on Bi LSTM
CN107943783A (en) A kind of segmenting method based on LSTM CNN

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant