CN111914539A - Channel announcement information extraction method and system based on BilSTM-CRF model - Google Patents
Channel announcement information extraction method and system based on BilSTM-CRF model Download PDFInfo
- Publication number
- CN111914539A CN111914539A CN202010756216.6A CN202010756216A CN111914539A CN 111914539 A CN111914539 A CN 111914539A CN 202010756216 A CN202010756216 A CN 202010756216A CN 111914539 A CN111914539 A CN 111914539A
- Authority
- CN
- China
- Prior art keywords
- channel
- information
- bilstm
- announcement
- extracting
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000605 extraction Methods 0.000 title claims abstract description 35
- 238000000034 method Methods 0.000 claims abstract description 39
- 230000011218 segmentation Effects 0.000 claims abstract description 38
- 238000012549 training Methods 0.000 claims abstract description 18
- 230000007246 mechanism Effects 0.000 claims abstract description 13
- 238000012800 visualization Methods 0.000 claims abstract description 12
- 230000003993 interaction Effects 0.000 claims abstract description 5
- 230000008569 process Effects 0.000 claims description 11
- 230000009193 crawling Effects 0.000 claims description 10
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 claims description 9
- 238000011144 upstream manufacturing Methods 0.000 claims description 6
- 238000012423 maintenance Methods 0.000 claims description 5
- 238000011068 loading method Methods 0.000 claims description 4
- 238000011156 evaluation Methods 0.000 claims description 3
- 238000012545 processing Methods 0.000 description 9
- 239000013598 vector Substances 0.000 description 9
- 238000010586 diagram Methods 0.000 description 8
- 239000011159 matrix material Substances 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 4
- 238000002372 labelling Methods 0.000 description 4
- 238000010801 machine learning Methods 0.000 description 4
- 230000008520 organization Effects 0.000 description 4
- 230000007704 transition Effects 0.000 description 4
- 230000002457 bidirectional effect Effects 0.000 description 3
- 238000010276 construction Methods 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 2
- 238000001914 filtration Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 230000010354 integration Effects 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 238000006424 Flood reaction Methods 0.000 description 1
- 206010063385 Intellectualisation Diseases 0.000 description 1
- 230000002159 abnormal effect Effects 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 238000007792 addition Methods 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004140 cleaning Methods 0.000 description 1
- 238000012790 confirmation Methods 0.000 description 1
- 230000001276 controlling effect Effects 0.000 description 1
- 238000011835 investigation Methods 0.000 description 1
- 230000007257 malfunction Effects 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003032 molecular docking Methods 0.000 description 1
- 230000000414 obstructive effect Effects 0.000 description 1
- 238000011084 recovery Methods 0.000 description 1
- 230000001105 regulatory effect Effects 0.000 description 1
- 238000009418 renovation Methods 0.000 description 1
- 239000004576 sand Substances 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 239000003643 water by type Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/242—Dictionaries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9537—Spatial or temporal dependent retrieval, e.g. spatiotemporal queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- Evolutionary Computation (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Biophysics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Probability & Statistics with Applications (AREA)
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a method and a system for extracting channel announcement information based on a BilSTM-CRF model, which comprises the steps of performing Chinese word segmentation according to channel related information, and constructing an electronic channel map object name word segmentation dictionary according to a channel element map layer when performing Chinese word segmentation to serve as a login dictionary; the method comprises the steps of dividing elements which have practical significance to users in the channel notice information according to mechanisms O, places L, topics S, events E and time T, constructing a text semantic extraction model of the channel notice, training by adopting a BilSTM-CRF model under the constraint of the text semantic extraction model, and extracting key information. The channel announcement information obtained by the invention can be used for channel announcement, text information visualization such as scale and the like, channel key area visualization, navigation auxiliary reminding, channel information interaction and pushing based on the mobile terminal and the like.
Description
Technical Field
The invention relates to the field of channel information intellectualization, in particular to a channel announcement information extraction method based on a BilSTM-CRF model.
Background
The channel announcement information is known content which is issued by channel departments to the public for ensuring the smoothness and safety of a channel, and through the channel announcement content, a ship can plan a navigation route in advance, so that potential safety hazards and property loss caused by obstacles are avoided as much as possible.
In the current digital channel informatization construction, a fixed structured template is not formed yet, channel notification information is mainly presented in a non-structured text form, so that the business cooperation is difficult to realize among related businesses through information sharing and flow docking, the comprehensive utilization efficiency of resources is low, and the texts need to be converted into structured data through a natural language processing technology to promote channel resource integration and sharing.
Therefore, there is a need in the art to provide a new practical technique for converting unstructured channel announcement data into structured data with spatial identifiers, so as to provide a data base for practical applications, such as intelligent spatial matching of channel announcement information with an electronic channel map in a changjiang channel map APP or other real-time application tools.
Disclosure of Invention
The invention aims to realize the technical scheme of extracting the channel notice information based on the BilSTM-CRF model, improve the utilization rate of the channel notice information and promote the integration and sharing of channel resources.
The technical scheme of the invention provides a channel announcement information extraction method based on a BilSTM-CRF model, which comprises the following steps:
step 1, Chinese word segmentation is carried out according to relevant information of a navigation channel, and when Chinese word segmentation is carried out, an electronic navigation channel graph target name word segmentation dictionary is constructed according to a navigation channel element graph layer and is used as a login dictionary;
and 2, extracting key information through geographic entity recognition, wherein the key information extraction is realized by dividing elements which have practical significance to users in the channel notice information according to a mechanism O, a place L, a subject S, an event E and time T, constructing a text semantic extraction model of the channel notice, training by adopting a BilSTM-CRF model under the constraint of the text semantic extraction model, and extracting the key information.
Furthermore, channel information acquisition is performed in advance, and comprises the steps of acquiring and storing channel related information, wherein the channel related information comprises channel announcements, planned water depth and maintenance scale; and acquiring relevant information of the navigation channel by adopting a focused web crawler mode.
And when crawling the page, putting the filtered links into the URL queue in turn according to the priorities of 'important', 'upstream', 'midstream' and 'downstream'.
Moreover, the electronic navigation channel map object name word segmentation dictionary is constructed according to the navigation channel element map layer in the following way,
step 1.1, loading channel element layers in batches;
step 1.2, reading the element, extracting the element name according to the attribute field, and storing the result to a read attribute name list;
step 1.3, judging whether unread elements exist at present, if so, continuing to read the elements, returning to the step 1.2, and if not, ending the reading process and entering the step 1.4;
and step 1.4, according to the final name list obtained in the step 1.2, writing the final name list into the text file in sequence according to the format of 'name + line feed' of the Chinese word segmentation dictionary, and outputting the final file as the word segmentation dictionary.
Moreover, in the text semantic extraction model of the channel announcement,
a mechanism O for identifying a channel announcement issuing mechanism;
a location L for identifying position-related information contained in the channel announcements, including typical channel features with unambiguous spatial location characteristics;
the theme S is used for identifying the main content contained in the channel announcement, wherein the main content comprises channel special element objects and the running state of a channel;
event E, used for identifying the procedural content in the channel announcement, including natural events and artificial events;
and the time T is used for identifying the release time of the channel announcement.
And training by adopting a BilSTM-CRF model under the constraint of the text semantic extraction model, wherein the training comprises the step of marking the text semantic extraction model by using a BIO marking set adopted in Bakeoff-3 evaluation, and the constraint is added to a finally predicted label on a CRF layer of the BilSTM-CRF model.
And the method is used for spatial information visualization, and comprises the steps of carrying out spatial matching with an electronic channel map based on geographic entities with labels as places identified by a BilSTM-CRF model, generating a geographic fence by taking a spatial position as a center, and marking and displaying real-time channel notification information.
And the method is used for channel key area visualization and navigation auxiliary reminding.
And the method is used for channel information interaction and pushing based on the mobile terminal.
The invention also provides a channel notice information extraction system based on the BilSTM-CRF model, which is used for executing the channel notice information extraction method based on the BilSTM-CRF model.
The invention provides a channel notice information extraction technology based on a BilSTM-CRF model, and the channel notice information is quickly extracted. According to the method, firstly, a network crawler technology is utilized to crawl and store channel related information on a channel bureau website, then intelligent processing is carried out on the crawled data, unstructured channel notification information is split into independent word units with specific meanings, then a text semantic extraction model of the channel notification 'mechanism-place-subject-event-time' (OLSET) is constructed, machine learning training is carried out by combining a bidirectional long-short term memory gating structure-discrete random field (BiLSTM-CRF) model, and finally machine intelligent extraction of key information of channel notification elements is achieved according to a training result. The obtained channel announcement information can be used for channel announcement, visualization of character information such as scales and the like, visualization of channel key areas, navigation auxiliary reminding, channel information interaction and pushing based on the mobile terminal and the like. The invention utilizes the electronic channel map object name to construct the word segmentation dictionary, can more accurately extract channel information than a conventional dictionary, is not only suitable for extracting information elements of channel notice, but also is also suitable for geospating and visualizing other information of shipping, and indexes such as identification accuracy, recall rate and the like of the electronic channel map object name are continuously improved along with the operation and the perfection of a machine learning model.
Drawings
FIG. 1 is a system block diagram of an embodiment of the present invention;
FIG. 2 is a schematic diagram of channel information acquisition according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of key information extraction according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a Chinese segmentation dictionary construction process according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of a Chinese word segmentation process according to an embodiment of the present invention;
FIG. 6 is a schematic structural diagram of a BilSTM-CRF model according to an embodiment of the present invention;
fig. 7 is a schematic diagram of spatial information visualization according to an embodiment of the present invention.
Detailed Description
The technical scheme of the invention is explained in detail in the following by combining the drawings and the embodiment.
The embodiment provides a processing flow of a channel announcement information extraction method based on a BilSTM-CRF model, which is specifically realized as follows:
firstly, acquiring and storing relevant information of a navigation channel in advance.
In the embodiment, a Focused web Crawler (Focused Crawler) technology is preferably used for crawling channel related information such as channel announcement, planned water depth, maintenance scale and the like from a website of a channel office of the Yangtze river, and the obtained result can be stored in a database. Example crawling process as in fig. 2, detailed implementation steps are described as follows:
step 1, definition and description of a crawling target: in the focused web crawler, firstly, a target crawled by the focused web crawler and description thereof are defined according to crawling requirements, namely a Yangtze river channel bureau channel service webpage comprises contents such as channel scale forecast, channel announcement, water level, tide level, safety early warning, comprehensive service information, a monthly water depth plan, an annual water depth plan and the like;
step 2, obtaining an initial URL (http:// www.cjhdj.com.cn/hdfw /);
step 3, crawling the page according to the initial URL and obtaining a new URL;
step 4, filtering links irrelevant to a crawling target from the new URL, for example, when a channel is crawled for notification, a filtering keyword of a URL address is 'channel _ node', namely all webpage addresses need to take 'http:// www.cjhdj.com.cn/hdfw/channel _ node/' as a start;
and 5, sequentially placing the filtered links into a URL queue:
in specific implementation, based on the Yangtze river channel bureau business division, a channel announcement webpage has sub-columns such as key points, upstream, midstream, downstream, summary and the like, the key column comprises channel information which has important reference significance and value for ship navigation, such as channel opening and closing, channel adjustment, channel emergency and the like, and the upstream, midstream and downstream columns provide announcement information corresponding to channel geographical section division and are usually divided according to geographical positions. Thus, the preferred suggestions may place the filtered links into the URL queue in order of priority for "important", "upstream", "midstream", and "downstream", for example:
"important" (http:// www.cjhdj.com.cn/hdfw/channel _ notice/hdtgzy /), "important" ("important"),
(http:// www.cjhdj.com.cn/hdfw/channel _ node/hdtgsy /),
(iii) mid-stream (http:// www.cjhdj.com.cn/hdfw/channel _ note/hdtgzy 1/),
(iv) < downstream > (http:// www.cjhdj.com.cn/hdfw/channel _ notice/hdtgxy >);
step 6, adopting a breadth-first crawling strategy to the filtered links to acquire webpage contents;
step 7, acquiring a next URL address to be crawled as an initial URL address, and repeating the step 3-7;
and 8, stopping crawling when the URL address needing to be crawled cannot be obtained.
Secondly, Chinese word segmentation and geographic entity recognition are carried out according to the input relevant information of the navigation channel, the extraction process is as shown in figure 3, and the detailed implementation steps are described as follows:
(1) chinese word segmentation
Because the electronic channel map contains the place names related to the channels, the navigation marks, the names of the channel facilities such as the renovation buildings and the like, and other special nouns which are not related in the conventional dictionary, the embodiment adopts the names of the electronic channel map objects to construct the word segmentation dictionary, the processing flow is as shown in fig. 4, and the word segmentation processing is carried out on the channel announcement title by adopting the jieba word segmentation tool under the python environment, the processing flow is as shown in fig. 5, and the detailed implementation steps are described as follows:
step 1, constructing an electronic channel map object name word segmentation dictionary, referring to fig. 4, and describing a specific process as follows:
step 1.1, loading the navigation channel element layers in batches.
Step 1.2, reading the element, extracting the element name according to the attribute field (such as NOBJNM), and saving the result to the read attribute name list.
And 1.3, judging whether unread elements exist at present, if so, continuing to read the elements, repeating the step 1.2, and if not, ending the reading process and entering the step 1.4.
And step 1.4, according to the final name list obtained in the step 1.2, writing the final name list into the text file in sequence according to a format of 'name + line feed' commonly used by the Chinese word segmentation dictionary, and outputting the final file as the word segmentation dictionary.
And 2, sentence cleaning is carried out on the sentence to be processed, special characters such as Latin symbols and the like which are coded based on utf8 and are irrelevant to word segmentation are separated, and the special characters are marked as unknown parts of speech.
And 3, loading the constructed electronic navigation path map object name word segmentation dictionary as a login dictionary to establish a trie tree word segmentation model (prefix dictionary).
Step 4, performing word graph scanning based on the prefix dictionary to generate a Directed Acyclic Graph (DAG) formed by all possible word forming conditions of the Chinese characters in the text;
step 5, searching a maximum probability path Route by adopting dynamic planning, and finding out a maximum segmentation combination based on word frequency;
step 6, marking the login words recorded in the word segmentation dictionary according to the dictionary;
step 7, identifying words which are not included in the word segmentation dictionary separately according to Chinese and English, giving corresponding labels to combinations of English, numbers and time forms, and calculating word forming probability by Chinese through a Hidden Markov Model (HMM) based on Chinese character word forming capability;
step 8, performing part-of-speech tagging based on a Viterbi algorithm;
and 9, extracting keywords based on the TF-IDF and the TextRank model.
(2) Named entity recognition
Step 1, although the current channel announcement information presents unstructured characteristics, the current channel announcement information still comprises specific element units, such as mechanisms, places, topics, events, time and the like, so that the geographic entity identification of the channel announcement information is allowed to be converted into a sequence labeling problem, the problem is simplified into structured classification, and the method lays a cushion for next deep learning. Dividing elements which have practical significance to users in the channel announcement information according to Organization (Organization), Location (Location), Subject (Subject), Event (Event) and Time (Time), thereby constructing a text semantic extraction model of the channel announcement "Organization-Location-Subject-Event-Time" (OLSET), wherein:
(1) o (organization) is the mechanism: and issuing mechanisms for identifying channel announcements, such as Changjiang XX channel bureau \ place and the like.
(2) L (location) is the location: the method is used for identifying position related information contained in the channel announcement, such as XX channel \ water area \ river reach \ shoal … … (only XX is marked, postfix contents of channel \ water area \ river reach \ shoal and the like are not marked), and typical channel ground objects with definite spatial position characteristics, such as bridges, wharfs and the like.
(3) S (subject) is the subject: the method is used for identifying the main content contained in the channel announcement, wherein the main content comprises channel special element objects, such as a control river reach, a shoal, a bridge area, a signal station, a special channel \ navigation mark and the like, and the operation state of the channel, such as contents of navigation prohibition \ non-navigation prohibition, shift collection \ shift start, navigation mark adjustment \ removal \ recovery \ arrangement \ malfunction \ abnormal operation … … and the like.
(4) E (event) is an event: the method is used for identifying the contents with procedural property in the channel notice, such as natural events of flood peaks, floods, dead waters, flood seasons, non-flood seasons and the like, or artificial events of channel maintenance, dredging, sand mining, construction, operation, investigation … … and the like.
(5) T (time) is time: and the release time is used for identifying the channel announcement, such as XX year, X month and X day.
Step 2, performing machine learning training by adopting a bidirectional long-short term memory gating structure-discrete random field (BilSTM-CRF) model, and extracting key information, wherein the model structure diagram is shown in FIG. 6, and the processing flow is described as follows:
1) based on the text semantic extraction model constructed in the step 1, a BIO (building information organization) annotation set adopted in Bakeoff-3 evaluation is used for annotating the model, namely B-ORG represents the first character of a mechanism, I-ORG represents the first character of a mechanism, B-LOC represents the first character of a place, I-LOC represents the first character of a place, B-SUB represents the first character of a subject, I-SUB represents the first character of a subject, B-EVE represents the first character of an event, I-EVE represents the first character of an event, B-TM represents the first character of an event, I-TM represents the first character of time, and O represents that the character does not belong to one part of a named entity.
The invention proposes that geographic entity recognition is actually a classification problem, so targets are divided according to business requirements, and subsequent steps are recognized through machine learning. In the embodiment, the crawled 'important' channel announcement information is used as a training data set to label the text semantic extraction model.
2) Taking a sentence as a unit, a sentence (a sequence of words) containing n words is written as:
x=(x1,x2,...,xn)
wherein xiAnd representing the id of the ith word in the sentence in the dictionary, and further obtaining a word vector of each word, wherein the dimension is the size of the dictionary.
3) Embedding vector matrix using pre-training or random initialization to convert each character x in sentenceiMapping from word vectors to low-dimensional dense word vectors xi(xi∈RdR is the word vector and d is the dimension of the vector) and sets the over-fit parameter dropout to mitigate the over-fit. dropout refers to temporarily discarding a neural network unit from a network according to a certain probability in the training process of a deep learning network.
4) And automatically extracting sentence characteristics. Embedding a sequence of vectors (x) for each word of a sentence1,x2,...,xn) As the input of each time step of the bidirectional LSTM, the hidden state sequence (h) of the forward LSTM is output1,h2...,hn) Hidden state sequence with inverted LSTM output (h'1,h'2...,h'n) Position-based splicing h for hidden states output at various positionst=[ht;h't]∈Rm(m is the dimension of the position) to obtain the complete hidden state sequence (h)1,h2...,hn)∈Rn*m。
5) After dropout is set, a linear layer is accessed, a hidden state vector is mapped from m dimension to k dimension, k is the label number of a label set, and thus the automatically extracted sentence characteristics are obtained and are recorded as an LSTM output matrix P ═ P (P)1,p2,...,pn)∈Rn*k。
Rn*kFor reduced-dimension word vector sets, piThe rank of the matrix is output for the LSTM.
Can be substituted by pi∈RkEach dimension p ofijAre all regarded as words xiIf the scoring value of the jth label is classified, if Softmax is carried out on P, the classification is equivalent to independent class k classification of each position. However, since the marked information cannot be used when marking each position, a conditional random field CRF layer is accessed for marking next.
6) Sentence-level sequence labeling is performed. The parameter of the CRF layer is a matrix A, A of (k +2) × (k +2)ijThe transition score from the ith tag to the jth tag is shown, and the tags marked before can be used when marking a position, so 2 is added to add a starting state to the head of the sentence and an ending state to the tail of the sentence. If a tag sequence y with a length equal to the sentence length is recorded (y)1,y2,...,yn) Then the model scores as follows for sentence x with a label equal to y:
wherein P isi,yiScore value, A, for sorting the ith word to the yi tagyi-1,yiRepresenting the transition score from the yi-1 st tag to the yi-th tag.
It can be seen that the score for the entire sequence is equal to the sum of the scores for the positions, and that the score for each position is derived from two parts, one part being the p output by the LSTMiThe other part is determined by the transfer matrix A of the CRF. Further, the normalized probability can be obtained by using Softmax:
wherein, ynIs a subsequence of tag sequence y, i.e., a tag that may be present. score (x, y) is a scoring that the label of sentence x equals y, score (x, y)n) The label for sentence x equals ynScoring of (4).
7) The log-likelihood estimate is maximized. The log-likelihood for one training sample (x, y) is given by:
8) a prediction tag for each word is obtained. The optimal path is solved using the dynamically planned Viterbi algorithm:
the Viterbi algorithm is a classical algorithm for solving the optimal path by dynamic programming, and the details of the invention are not repeated.
9) The CRF layer rules constraints. The tags for each word in the sentence are available through B-LSTM, but there is no guarantee that the tags are predicted correctly each time. The CRF layer may add constraints to the last predicted label to ensure that the predicted label is consistent with the rules, and the constraints may be automatically learned through the CRF layer during training of the training data. And accessing a CRF layer to predict sentence-level labels, so that the labeling process does not independently classify each word any more, the transition probability of the sequence is introduced, and finally the function loss is calculated and fed back to the network. Under the action of CRF, the sequence can be regulated according to transition probability.
In the embodiment, after the training and learning of the model are finished, the crawled 'upstream', 'midstream' and 'downstream' channel announcement information is used as a test data set to verify and evaluate the model processing result.
In specific implementation, the method provided by the technical scheme of the invention can be implemented by a person skilled in the art by adopting a computer software technology to realize an automatic operation process, and a system device for operating the method also needs to be in the protection scope of the invention. Referring to fig. 1, the embodiment further provides a channel announcement information extraction system based on the BiLSTM-CRF model, which includes a chinese word segmentation module (10) and a named entity identification module (20).
The Chinese word segmentation module (10) is used for performing Chinese word segmentation according to the relevant information of the navigation channel, and when the Chinese word segmentation is performed, an electronic navigation channel map object name word segmentation dictionary is constructed according to the navigation channel element map layer and is used as a login dictionary;
the named entity recognition module (20) is used for realizing key information extraction through geographic entity recognition, and comprises the steps of dividing elements which have practical significance to users in the channel notice information according to mechanisms O, places L, subjects S, events E and time T, constructing a text semantic extraction model of the channel notice, training by adopting a BilSTM-CRF model under the constraint of the text semantic extraction model, and extracting key information.
The implementation of each module can refer to the implementation description of the corresponding method step, and the invention is not repeated.
The technical scheme of the invention can be applied to various subsequent applications, such as:
(1) and the text information such as channel announcement, scale and the like is visualized. The Yangtze river channel map APP obtains channel announcement character information published by the Yangtze river channel bureau, information such as geographic positions, ranges, key contents and start-stop time in the character information is extracted through natural language processing technologies such as template matching word segmentation and named entity recognition, the information is subjected to structuring processing, spatial information such as the geographic positions and the ranges in the character information is matched with elements such as a water channel and mileage on an electronic channel map, a geographic fence is constructed through a coordinate point and a buffer area or a polygon, the channel announcement and the corresponding key contents in the scale are displayed on the geographic fence, and the start-stop time is used for controlling display and cancellation of the information.
(2) Visualization of important areas of the channel and navigation auxiliary reminding. Constructing a geo-fence by using regional areas such as an anchor area, a warning area, a navigation limiting area and the like, and displaying the geo-fence on an electronic channel map in an overlapping manner, wherein when a ship positioning signal is close to or in the range of the geo-fence, related early warning and warning information can be pushed to the ship positioning signal; the method comprises the steps that a geo-fence is constructed in the water channel range of the Yangtze river electronic channel map, and when a ship positioning signal is close to or located in the geo-fence range of the water channel, the channel maintenance scale, the navigation mark water level information, the channel announcement, the surrounding meteorological information and the related geographical information interest Points (POI) related to the water channel can be pushed to the geo-fence.
(3) Channel information interaction and pushing based on the mobile terminal. The Changjiang river channel chart APP provides a user reporting function, clicks on the electronic channel chart to obtain geographic coordinates of the position, creates a geographic fence according to the geographic coordinates, and generates reporting information. The user can attach text description or field photo to the reported information, and report the navigation field condition information to the relevant management department, which is beneficial to the first time confirmation of the field and the rapid update and release of the information.
For the sake of reference, the embodiment of the present invention provides a detailed description of the channel information visualization as follows:
the identified geographic entities, namely, the entities marked (labeled) as "locations" (locations) in the named entity identification step, are spatially matched with the electronic channel map, a geographic fence is generated by taking a spatial position as a center, real-time channel notification information is marked, the visualization process is as shown in fig. 7, and the detailed implementation steps are described as follows:
step 1, analyzing and acquiring longitude and latitude of the current position based on AIS data or mobile terminal GPS data, judging whether the current position is located in a relevant APP map range, and if not, roaming to the map where the current position is located.
Step 2, extracting the center of the navigation channel element ground object so as to draw the notification information at the center position of the ground object: and performing superposition analysis in the current map range to obtain typical channel ground objects with definite spatial position characteristics, such as a channel, a navigation mark, a bridge, a wharf and the like, and sequentially calculating the central position of the typical channel ground objects, so that channel notification information can be drawn in the middle. For point-like ground objects such as navigation marks, obstructive objects and the like, the central position of the point-like ground objects is represented by an actual position; for a linear or planar ground object such as a bridge, a wharf, a water channel, etc., the center position thereof can be expressed as:
wherein xiAnd yiIs the coordinate of the point element i constituting the line and plane elements, and n is equal to the total number of the point elements constituting the line and plane elements.
And 3, calculating a proper (such as one third of the screen width) buffer area radius or a polygonal range according to the current mobile equipment resolution and the center position obtained in the step 2, and sequentially constructing the geo-fences.
And 4, calculating whether the geofences constructed in the step 3 are covered or not, wherein for the simple polygonal geofence, a ray method has high query efficiency, starting from each point of the geofence A, drawing a ray along an X axis, judging the intersection point of the ray and each edge of the geofence B, counting the number of the intersection points, if the number of the intersection points is even, the geofences A and B are not covered, otherwise, the geofences A and B are covered, and at the moment, the geofence range needs to be adjusted or the geofence range needs to be subjected to offset processing.
And 5, sequentially acquiring corresponding key information through the WebService service request based on the ground object name acquired in the step 2.
And 6, organizing according to a preset specific format (such as a ground object name + an event + time) to simplify the channel notification information based on the ground object center position obtained in the step 2 and the key information obtained in the step 5, and drawing and labeling in the geo-fence range determined in the step 4.
In specific implementation, the above applications can also be automatically run in a software manner.
The specific embodiments described herein are merely illustrative of the spirit of the invention. Various modifications or additions may be made to the described embodiments or alternatives may be employed by those skilled in the art without departing from the spirit or ambit of the invention as defined in the appended claims.
Claims (10)
1. A channel announcement information extraction method based on a BilSTM-CRF model is characterized by comprising the following steps:
step 1, Chinese word segmentation is carried out according to relevant information of a navigation channel, and when Chinese word segmentation is carried out, an electronic navigation channel graph target name word segmentation dictionary is constructed according to a navigation channel element graph layer and is used as a login dictionary;
and 2, extracting key information through geographic entity recognition, wherein the key information extraction is realized by dividing elements which have practical significance to users in the channel notice information according to a mechanism O, a place L, a subject S, an event E and time T, constructing a text semantic extraction model of the channel notice, training by adopting a BilSTM-CRF model under the constraint of the text semantic extraction model, and extracting the key information.
2. The method for extracting information of a channel notice based on a BilSTM-CRF model as claimed in claim 1, wherein: channel information acquisition is carried out in advance, and comprises the steps of acquiring and storing channel related information, wherein the channel related information comprises channel announcements, planned water depth and maintenance scale; and acquiring relevant information of the navigation channel by adopting a focused web crawler mode.
3. The method for extracting information of a channel notice based on a BilSTM-CRF model as claimed in claim 2, wherein: and when crawling the page, putting the filtered links into a URL queue in sequence according to the priorities of 'important', 'upstream', 'midstream' and 'downstream'.
4. The method for extracting BiLSTM-CRF model-based channel announcement information as claimed in claim 1, 2 or 3, wherein: the implementation mode of constructing the electronic channel map object name word segmentation dictionary according to the channel element map layer is as follows,
step 1.1, loading channel element layers in batches;
step 1.2, reading the element, extracting the element name according to the attribute field, and storing the result to a read attribute name list;
step 1.3, judging whether unread elements exist at present, if so, continuing to read the elements, returning to the step 1.2, and if not, ending the reading process and entering the step 1.4;
and step 1.4, according to the final name list obtained in the step 1.2, writing the final name list into the text file in sequence according to the format of 'name + line feed' of the Chinese word segmentation dictionary, and outputting the final file as the word segmentation dictionary.
5. The method for extracting BiLSTM-CRF model-based channel announcement information as claimed in claim 1, 2 or 3, wherein: in the text semantic extraction model of the channel announcement,
a mechanism O for identifying a channel announcement issuing mechanism;
a location L for identifying position-related information contained in the channel announcements, including typical channel features with unambiguous spatial location characteristics;
the theme S is used for identifying the main content contained in the channel announcement, wherein the main content comprises channel special element objects and the running state of a channel;
event E, used for identifying the procedural content in the channel announcement, including natural events and artificial events;
and the time T is used for identifying the release time of the channel announcement.
6. The method for extracting BiLSTM-CRF model-based channel announcement information as claimed in claim 1, 2 or 3, wherein: and training by adopting a BilSTM-CRF model under the constraint of the text semantic extraction model, wherein the training comprises the steps of marking the text semantic extraction model by using a BIO marking set adopted in Bakeoff-3 evaluation, and adding constraint for a finally predicted label on a CRF layer of the BilSTM-CRF model.
7. The method for extracting BiLSTM-CRF model-based channel announcement information as claimed in claim 1, 2 or 3, wherein: the method is used for spatial information visualization, and comprises the steps of carrying out spatial matching on a geographic entity with a place as a label identified by a BilSTM-CRF model and an electronic channel map, generating a geographic fence by taking a spatial position as a center, and marking and displaying real-time channel notification information.
8. The method for extracting BiLSTM-CRF model-based channel announcement information as claimed in claim 1, 2 or 3, wherein: the method is used for channel key area visualization and navigation auxiliary reminding.
9. The method for extracting BiLSTM-CRF model-based channel announcement information as claimed in claim 1, 2 or 3, wherein: the method is used for channel information interaction and pushing based on the mobile terminal.
10. A channel announcement information extraction system based on a BilSTM-CRF model is characterized in that: for carrying out the method of extracting information of a route announcement based on a BiLSTM-CRF model as claimed in claims 1 to 9.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010756216.6A CN111914539B (en) | 2020-07-31 | 2020-07-31 | Channel notification information extraction method and system based on BiLSTM-CRF model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010756216.6A CN111914539B (en) | 2020-07-31 | 2020-07-31 | Channel notification information extraction method and system based on BiLSTM-CRF model |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111914539A true CN111914539A (en) | 2020-11-10 |
CN111914539B CN111914539B (en) | 2024-09-10 |
Family
ID=73288173
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010756216.6A Active CN111914539B (en) | 2020-07-31 | 2020-07-31 | Channel notification information extraction method and system based on BiLSTM-CRF model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111914539B (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112861540A (en) * | 2021-04-25 | 2021-05-28 | 成都索贝视频云计算有限公司 | Broadcast television news keyword automatic extraction method based on deep learning |
CN113011183A (en) * | 2021-03-23 | 2021-06-22 | 北京科东电力控制系统有限责任公司 | Unstructured text data processing method and system in electric power regulation and control field |
CN113127503A (en) * | 2021-03-18 | 2021-07-16 | 中国科学院国家空间科学中心 | Automatic information extraction method and system for aerospace information |
CN113282767A (en) * | 2021-04-30 | 2021-08-20 | 武汉大学 | Text-oriented relative position information extraction method |
CN114819771A (en) * | 2022-06-28 | 2022-07-29 | 北京中海住梦科技有限公司 | Task allocation method and device, storage medium and electronic equipment |
CN118055175A (en) * | 2024-04-16 | 2024-05-17 | 南京莱斯信息技术股份有限公司 | Message analysis processing method combining rule engine and deep learning |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2018028077A1 (en) * | 2016-08-11 | 2018-02-15 | 中兴通讯股份有限公司 | Deep learning based method and device for chinese semantics analysis |
CN108595430A (en) * | 2018-04-26 | 2018-09-28 | 携程旅游网络技术(上海)有限公司 | Boat becomes information extracting method and system |
CN109697285A (en) * | 2018-12-13 | 2019-04-30 | 中南大学 | Enhance the hierarchical B iLSTM Chinese electronic health record disease code mask method of semantic expressiveness |
CN110990565A (en) * | 2019-11-20 | 2020-04-10 | 广州商品清算中心股份有限公司 | Extensible text analysis system and method for public sentiment analysis |
CN111274804A (en) * | 2020-01-17 | 2020-06-12 | 珠海市新德汇信息技术有限公司 | Case information extraction method based on named entity recognition |
-
2020
- 2020-07-31 CN CN202010756216.6A patent/CN111914539B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2018028077A1 (en) * | 2016-08-11 | 2018-02-15 | 中兴通讯股份有限公司 | Deep learning based method and device for chinese semantics analysis |
CN108595430A (en) * | 2018-04-26 | 2018-09-28 | 携程旅游网络技术(上海)有限公司 | Boat becomes information extracting method and system |
CN109697285A (en) * | 2018-12-13 | 2019-04-30 | 中南大学 | Enhance the hierarchical B iLSTM Chinese electronic health record disease code mask method of semantic expressiveness |
CN110990565A (en) * | 2019-11-20 | 2020-04-10 | 广州商品清算中心股份有限公司 | Extensible text analysis system and method for public sentiment analysis |
CN111274804A (en) * | 2020-01-17 | 2020-06-12 | 珠海市新德汇信息技术有限公司 | Case information extraction method based on named entity recognition |
Non-Patent Citations (1)
Title |
---|
王红;李浩飞;邸帅;: "民航突发事件实体识别方法研究", 计算机应用与软件, no. 03, 12 March 2020 (2020-03-12) * |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113127503A (en) * | 2021-03-18 | 2021-07-16 | 中国科学院国家空间科学中心 | Automatic information extraction method and system for aerospace information |
CN113011183A (en) * | 2021-03-23 | 2021-06-22 | 北京科东电力控制系统有限责任公司 | Unstructured text data processing method and system in electric power regulation and control field |
CN113011183B (en) * | 2021-03-23 | 2023-09-05 | 北京科东电力控制系统有限责任公司 | Unstructured text data processing method and system in electric power regulation and control field |
CN112861540A (en) * | 2021-04-25 | 2021-05-28 | 成都索贝视频云计算有限公司 | Broadcast television news keyword automatic extraction method based on deep learning |
CN113282767A (en) * | 2021-04-30 | 2021-08-20 | 武汉大学 | Text-oriented relative position information extraction method |
CN113282767B (en) * | 2021-04-30 | 2022-08-30 | 武汉大学 | Text-oriented relative position information extraction method |
CN114819771A (en) * | 2022-06-28 | 2022-07-29 | 北京中海住梦科技有限公司 | Task allocation method and device, storage medium and electronic equipment |
CN118055175A (en) * | 2024-04-16 | 2024-05-17 | 南京莱斯信息技术股份有限公司 | Message analysis processing method combining rule engine and deep learning |
Also Published As
Publication number | Publication date |
---|---|
CN111914539B (en) | 2024-09-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111914539B (en) | Channel notification information extraction method and system based on BiLSTM-CRF model | |
Singh et al. | Event classification and location prediction from tweets during disasters | |
McDonough et al. | Named entity recognition goes to old regime France: geographic text analysis for early modern French corpora | |
JP5390840B2 (en) | Information analyzer | |
Li et al. | Mining trajectory data and geotagged data in social media for road map inference | |
CN103886020B (en) | A kind of real estate information method for fast searching | |
Prasad et al. | Identification and classification of transportation disaster tweets using improved bidirectional encoder representations from transformers | |
WO2019227581A1 (en) | Interest point recognition method, apparatus, terminal device, and storage medium | |
JP2022532451A (en) | How to disambiguate Chinese place name meanings based on encyclopedia knowledge base and word embedding | |
Richter et al. | Zooming in–zooming out hierarchies in place descriptions | |
Vaccari et al. | A holistic framework for the study of urban traces and the profiling of urban processes and dynamics | |
Stock et al. | Detecting geospatial location descriptions in natural language text | |
CN114298035A (en) | Text recognition desensitization method and system thereof | |
Wang et al. | Mapping the landscape and roadmap of geospatial artificial intelligence (GeoAI) in quantitative human geography: An extensive systematic review | |
US20230316098A1 (en) | Machine learning techniques for extracting interpretability data and entity-value pairs | |
WO2020170020A1 (en) | Feedback mining with domain-specific modeling | |
CN111914538B (en) | Channel notification information intelligent space matching method and system | |
Fernández-Martínez et al. | nLORE: A linguistically rich deep-learning system for locative-reference extraction in tweets | |
Jaiswal et al. | GeoCAM: A geovisual analytics workspace to contextualize and interpret statements about movement | |
US20230259809A1 (en) | Machine learning techniques for context-based document classification | |
Terblanche et al. | Ontology‐based employer demand management | |
CN113626536B (en) | News geocoding method based on deep learning | |
CN110969836A (en) | Road condition real-time analysis system based on network big data | |
CN114780744A (en) | Figure resume analysis method for knowledge graph construction | |
Soni | Integration of traffic data from social media and physical sensors for near real time road traffic analysis |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant |