CN111914539A - Channel announcement information extraction method and system based on BilSTM-CRF model - Google Patents

Channel announcement information extraction method and system based on BilSTM-CRF model Download PDF

Info

Publication number
CN111914539A
CN111914539A CN202010756216.6A CN202010756216A CN111914539A CN 111914539 A CN111914539 A CN 111914539A CN 202010756216 A CN202010756216 A CN 202010756216A CN 111914539 A CN111914539 A CN 111914539A
Authority
CN
China
Prior art keywords
channel
information
bilstm
announcement
extracting
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010756216.6A
Other languages
Chinese (zh)
Other versions
CN111914539B (en
Inventor
杨保岑
朱剑华
何明宪
张秋实
李�赫
李莉
徐硕
朱楠
周冠男
吕霖
徐乐
李伟凡
李艳芳
彭洋
刘思鹏
杨传波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CHANGJIANG WATERWAY SURVEY CENTER
Original Assignee
CHANGJIANG WATERWAY SURVEY CENTER
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CHANGJIANG WATERWAY SURVEY CENTER filed Critical CHANGJIANG WATERWAY SURVEY CENTER
Priority to CN202010756216.6A priority Critical patent/CN111914539B/en
Publication of CN111914539A publication Critical patent/CN111914539A/en
Application granted granted Critical
Publication of CN111914539B publication Critical patent/CN111914539B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9537Spatial or temporal dependent retrieval, e.g. spatiotemporal queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method and a system for extracting channel announcement information based on a BilSTM-CRF model, which comprises the steps of performing Chinese word segmentation according to channel related information, and constructing an electronic channel map object name word segmentation dictionary according to a channel element map layer when performing Chinese word segmentation to serve as a login dictionary; the method comprises the steps of dividing elements which have practical significance to users in the channel notice information according to mechanisms O, places L, topics S, events E and time T, constructing a text semantic extraction model of the channel notice, training by adopting a BilSTM-CRF model under the constraint of the text semantic extraction model, and extracting key information. The channel announcement information obtained by the invention can be used for channel announcement, text information visualization such as scale and the like, channel key area visualization, navigation auxiliary reminding, channel information interaction and pushing based on the mobile terminal and the like.

Description

Channel announcement information extraction method and system based on BilSTM-CRF model
Technical Field
The invention relates to the field of channel information intellectualization, in particular to a channel announcement information extraction method based on a BilSTM-CRF model.
Background
The channel announcement information is known content which is issued by channel departments to the public for ensuring the smoothness and safety of a channel, and through the channel announcement content, a ship can plan a navigation route in advance, so that potential safety hazards and property loss caused by obstacles are avoided as much as possible.
In the current digital channel informatization construction, a fixed structured template is not formed yet, channel notification information is mainly presented in a non-structured text form, so that the business cooperation is difficult to realize among related businesses through information sharing and flow docking, the comprehensive utilization efficiency of resources is low, and the texts need to be converted into structured data through a natural language processing technology to promote channel resource integration and sharing.
Therefore, there is a need in the art to provide a new practical technique for converting unstructured channel announcement data into structured data with spatial identifiers, so as to provide a data base for practical applications, such as intelligent spatial matching of channel announcement information with an electronic channel map in a changjiang channel map APP or other real-time application tools.
Disclosure of Invention
The invention aims to realize the technical scheme of extracting the channel notice information based on the BilSTM-CRF model, improve the utilization rate of the channel notice information and promote the integration and sharing of channel resources.
The technical scheme of the invention provides a channel announcement information extraction method based on a BilSTM-CRF model, which comprises the following steps:
step 1, Chinese word segmentation is carried out according to relevant information of a navigation channel, and when Chinese word segmentation is carried out, an electronic navigation channel graph target name word segmentation dictionary is constructed according to a navigation channel element graph layer and is used as a login dictionary;
and 2, extracting key information through geographic entity recognition, wherein the key information extraction is realized by dividing elements which have practical significance to users in the channel notice information according to a mechanism O, a place L, a subject S, an event E and time T, constructing a text semantic extraction model of the channel notice, training by adopting a BilSTM-CRF model under the constraint of the text semantic extraction model, and extracting the key information.
Furthermore, channel information acquisition is performed in advance, and comprises the steps of acquiring and storing channel related information, wherein the channel related information comprises channel announcements, planned water depth and maintenance scale; and acquiring relevant information of the navigation channel by adopting a focused web crawler mode.
And when crawling the page, putting the filtered links into the URL queue in turn according to the priorities of 'important', 'upstream', 'midstream' and 'downstream'.
Moreover, the electronic navigation channel map object name word segmentation dictionary is constructed according to the navigation channel element map layer in the following way,
step 1.1, loading channel element layers in batches;
step 1.2, reading the element, extracting the element name according to the attribute field, and storing the result to a read attribute name list;
step 1.3, judging whether unread elements exist at present, if so, continuing to read the elements, returning to the step 1.2, and if not, ending the reading process and entering the step 1.4;
and step 1.4, according to the final name list obtained in the step 1.2, writing the final name list into the text file in sequence according to the format of 'name + line feed' of the Chinese word segmentation dictionary, and outputting the final file as the word segmentation dictionary.
Moreover, in the text semantic extraction model of the channel announcement,
a mechanism O for identifying a channel announcement issuing mechanism;
a location L for identifying position-related information contained in the channel announcements, including typical channel features with unambiguous spatial location characteristics;
the theme S is used for identifying the main content contained in the channel announcement, wherein the main content comprises channel special element objects and the running state of a channel;
event E, used for identifying the procedural content in the channel announcement, including natural events and artificial events;
and the time T is used for identifying the release time of the channel announcement.
And training by adopting a BilSTM-CRF model under the constraint of the text semantic extraction model, wherein the training comprises the step of marking the text semantic extraction model by using a BIO marking set adopted in Bakeoff-3 evaluation, and the constraint is added to a finally predicted label on a CRF layer of the BilSTM-CRF model.
And the method is used for spatial information visualization, and comprises the steps of carrying out spatial matching with an electronic channel map based on geographic entities with labels as places identified by a BilSTM-CRF model, generating a geographic fence by taking a spatial position as a center, and marking and displaying real-time channel notification information.
And the method is used for channel key area visualization and navigation auxiliary reminding.
And the method is used for channel information interaction and pushing based on the mobile terminal.
The invention also provides a channel notice information extraction system based on the BilSTM-CRF model, which is used for executing the channel notice information extraction method based on the BilSTM-CRF model.
The invention provides a channel notice information extraction technology based on a BilSTM-CRF model, and the channel notice information is quickly extracted. According to the method, firstly, a network crawler technology is utilized to crawl and store channel related information on a channel bureau website, then intelligent processing is carried out on the crawled data, unstructured channel notification information is split into independent word units with specific meanings, then a text semantic extraction model of the channel notification 'mechanism-place-subject-event-time' (OLSET) is constructed, machine learning training is carried out by combining a bidirectional long-short term memory gating structure-discrete random field (BiLSTM-CRF) model, and finally machine intelligent extraction of key information of channel notification elements is achieved according to a training result. The obtained channel announcement information can be used for channel announcement, visualization of character information such as scales and the like, visualization of channel key areas, navigation auxiliary reminding, channel information interaction and pushing based on the mobile terminal and the like. The invention utilizes the electronic channel map object name to construct the word segmentation dictionary, can more accurately extract channel information than a conventional dictionary, is not only suitable for extracting information elements of channel notice, but also is also suitable for geospating and visualizing other information of shipping, and indexes such as identification accuracy, recall rate and the like of the electronic channel map object name are continuously improved along with the operation and the perfection of a machine learning model.
Drawings
FIG. 1 is a system block diagram of an embodiment of the present invention;
FIG. 2 is a schematic diagram of channel information acquisition according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of key information extraction according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a Chinese segmentation dictionary construction process according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of a Chinese word segmentation process according to an embodiment of the present invention;
FIG. 6 is a schematic structural diagram of a BilSTM-CRF model according to an embodiment of the present invention;
fig. 7 is a schematic diagram of spatial information visualization according to an embodiment of the present invention.
Detailed Description
The technical scheme of the invention is explained in detail in the following by combining the drawings and the embodiment.
The embodiment provides a processing flow of a channel announcement information extraction method based on a BilSTM-CRF model, which is specifically realized as follows:
firstly, acquiring and storing relevant information of a navigation channel in advance.
In the embodiment, a Focused web Crawler (Focused Crawler) technology is preferably used for crawling channel related information such as channel announcement, planned water depth, maintenance scale and the like from a website of a channel office of the Yangtze river, and the obtained result can be stored in a database. Example crawling process as in fig. 2, detailed implementation steps are described as follows:
step 1, definition and description of a crawling target: in the focused web crawler, firstly, a target crawled by the focused web crawler and description thereof are defined according to crawling requirements, namely a Yangtze river channel bureau channel service webpage comprises contents such as channel scale forecast, channel announcement, water level, tide level, safety early warning, comprehensive service information, a monthly water depth plan, an annual water depth plan and the like;
step 2, obtaining an initial URL (http:// www.cjhdj.com.cn/hdfw /);
step 3, crawling the page according to the initial URL and obtaining a new URL;
step 4, filtering links irrelevant to a crawling target from the new URL, for example, when a channel is crawled for notification, a filtering keyword of a URL address is 'channel _ node', namely all webpage addresses need to take 'http:// www.cjhdj.com.cn/hdfw/channel _ node/' as a start;
and 5, sequentially placing the filtered links into a URL queue:
in specific implementation, based on the Yangtze river channel bureau business division, a channel announcement webpage has sub-columns such as key points, upstream, midstream, downstream, summary and the like, the key column comprises channel information which has important reference significance and value for ship navigation, such as channel opening and closing, channel adjustment, channel emergency and the like, and the upstream, midstream and downstream columns provide announcement information corresponding to channel geographical section division and are usually divided according to geographical positions. Thus, the preferred suggestions may place the filtered links into the URL queue in order of priority for "important", "upstream", "midstream", and "downstream", for example:
"important" (http:// www.cjhdj.com.cn/hdfw/channel _ notice/hdtgzy /), "important" ("important"),
(http:// www.cjhdj.com.cn/hdfw/channel _ node/hdtgsy /),
(iii) mid-stream (http:// www.cjhdj.com.cn/hdfw/channel _ note/hdtgzy 1/),
(iv) < downstream > (http:// www.cjhdj.com.cn/hdfw/channel _ notice/hdtgxy >);
step 6, adopting a breadth-first crawling strategy to the filtered links to acquire webpage contents;
step 7, acquiring a next URL address to be crawled as an initial URL address, and repeating the step 3-7;
and 8, stopping crawling when the URL address needing to be crawled cannot be obtained.
Secondly, Chinese word segmentation and geographic entity recognition are carried out according to the input relevant information of the navigation channel, the extraction process is as shown in figure 3, and the detailed implementation steps are described as follows:
(1) chinese word segmentation
Because the electronic channel map contains the place names related to the channels, the navigation marks, the names of the channel facilities such as the renovation buildings and the like, and other special nouns which are not related in the conventional dictionary, the embodiment adopts the names of the electronic channel map objects to construct the word segmentation dictionary, the processing flow is as shown in fig. 4, and the word segmentation processing is carried out on the channel announcement title by adopting the jieba word segmentation tool under the python environment, the processing flow is as shown in fig. 5, and the detailed implementation steps are described as follows:
step 1, constructing an electronic channel map object name word segmentation dictionary, referring to fig. 4, and describing a specific process as follows:
step 1.1, loading the navigation channel element layers in batches.
Step 1.2, reading the element, extracting the element name according to the attribute field (such as NOBJNM), and saving the result to the read attribute name list.
And 1.3, judging whether unread elements exist at present, if so, continuing to read the elements, repeating the step 1.2, and if not, ending the reading process and entering the step 1.4.
And step 1.4, according to the final name list obtained in the step 1.2, writing the final name list into the text file in sequence according to a format of 'name + line feed' commonly used by the Chinese word segmentation dictionary, and outputting the final file as the word segmentation dictionary.
And 2, sentence cleaning is carried out on the sentence to be processed, special characters such as Latin symbols and the like which are coded based on utf8 and are irrelevant to word segmentation are separated, and the special characters are marked as unknown parts of speech.
And 3, loading the constructed electronic navigation path map object name word segmentation dictionary as a login dictionary to establish a trie tree word segmentation model (prefix dictionary).
Step 4, performing word graph scanning based on the prefix dictionary to generate a Directed Acyclic Graph (DAG) formed by all possible word forming conditions of the Chinese characters in the text;
step 5, searching a maximum probability path Route by adopting dynamic planning, and finding out a maximum segmentation combination based on word frequency;
step 6, marking the login words recorded in the word segmentation dictionary according to the dictionary;
step 7, identifying words which are not included in the word segmentation dictionary separately according to Chinese and English, giving corresponding labels to combinations of English, numbers and time forms, and calculating word forming probability by Chinese through a Hidden Markov Model (HMM) based on Chinese character word forming capability;
step 8, performing part-of-speech tagging based on a Viterbi algorithm;
and 9, extracting keywords based on the TF-IDF and the TextRank model.
(2) Named entity recognition
Step 1, although the current channel announcement information presents unstructured characteristics, the current channel announcement information still comprises specific element units, such as mechanisms, places, topics, events, time and the like, so that the geographic entity identification of the channel announcement information is allowed to be converted into a sequence labeling problem, the problem is simplified into structured classification, and the method lays a cushion for next deep learning. Dividing elements which have practical significance to users in the channel announcement information according to Organization (Organization), Location (Location), Subject (Subject), Event (Event) and Time (Time), thereby constructing a text semantic extraction model of the channel announcement "Organization-Location-Subject-Event-Time" (OLSET), wherein:
(1) o (organization) is the mechanism: and issuing mechanisms for identifying channel announcements, such as Changjiang XX channel bureau \ place and the like.
(2) L (location) is the location: the method is used for identifying position related information contained in the channel announcement, such as XX channel \ water area \ river reach \ shoal … … (only XX is marked, postfix contents of channel \ water area \ river reach \ shoal and the like are not marked), and typical channel ground objects with definite spatial position characteristics, such as bridges, wharfs and the like.
(3) S (subject) is the subject: the method is used for identifying the main content contained in the channel announcement, wherein the main content comprises channel special element objects, such as a control river reach, a shoal, a bridge area, a signal station, a special channel \ navigation mark and the like, and the operation state of the channel, such as contents of navigation prohibition \ non-navigation prohibition, shift collection \ shift start, navigation mark adjustment \ removal \ recovery \ arrangement \ malfunction \ abnormal operation … … and the like.
(4) E (event) is an event: the method is used for identifying the contents with procedural property in the channel notice, such as natural events of flood peaks, floods, dead waters, flood seasons, non-flood seasons and the like, or artificial events of channel maintenance, dredging, sand mining, construction, operation, investigation … … and the like.
(5) T (time) is time: and the release time is used for identifying the channel announcement, such as XX year, X month and X day.
Step 2, performing machine learning training by adopting a bidirectional long-short term memory gating structure-discrete random field (BilSTM-CRF) model, and extracting key information, wherein the model structure diagram is shown in FIG. 6, and the processing flow is described as follows:
1) based on the text semantic extraction model constructed in the step 1, a BIO (building information organization) annotation set adopted in Bakeoff-3 evaluation is used for annotating the model, namely B-ORG represents the first character of a mechanism, I-ORG represents the first character of a mechanism, B-LOC represents the first character of a place, I-LOC represents the first character of a place, B-SUB represents the first character of a subject, I-SUB represents the first character of a subject, B-EVE represents the first character of an event, I-EVE represents the first character of an event, B-TM represents the first character of an event, I-TM represents the first character of time, and O represents that the character does not belong to one part of a named entity.
The invention proposes that geographic entity recognition is actually a classification problem, so targets are divided according to business requirements, and subsequent steps are recognized through machine learning. In the embodiment, the crawled 'important' channel announcement information is used as a training data set to label the text semantic extraction model.
2) Taking a sentence as a unit, a sentence (a sequence of words) containing n words is written as:
x=(x1,x2,...,xn)
wherein xiAnd representing the id of the ith word in the sentence in the dictionary, and further obtaining a word vector of each word, wherein the dimension is the size of the dictionary.
3) Embedding vector matrix using pre-training or random initialization to convert each character x in sentenceiMapping from word vectors to low-dimensional dense word vectors xi(xi∈RdR is the word vector and d is the dimension of the vector) and sets the over-fit parameter dropout to mitigate the over-fit. dropout refers to temporarily discarding a neural network unit from a network according to a certain probability in the training process of a deep learning network.
4) And automatically extracting sentence characteristics. Embedding a sequence of vectors (x) for each word of a sentence1,x2,...,xn) As the input of each time step of the bidirectional LSTM, the hidden state sequence (h) of the forward LSTM is output1,h2...,hn) Hidden state sequence with inverted LSTM output (h'1,h'2...,h'n) Position-based splicing h for hidden states output at various positionst=[ht;h't]∈Rm(m is the dimension of the position) to obtain the complete hidden state sequence (h)1,h2...,hn)∈Rn*m
5) After dropout is set, a linear layer is accessed, a hidden state vector is mapped from m dimension to k dimension, k is the label number of a label set, and thus the automatically extracted sentence characteristics are obtained and are recorded as an LSTM output matrix P ═ P (P)1,p2,...,pn)∈Rn*k
Rn*kFor reduced-dimension word vector sets, piThe rank of the matrix is output for the LSTM.
Can be substituted by pi∈RkEach dimension p ofijAre all regarded as words xiIf the scoring value of the jth label is classified, if Softmax is carried out on P, the classification is equivalent to independent class k classification of each position. However, since the marked information cannot be used when marking each position, a conditional random field CRF layer is accessed for marking next.
6) Sentence-level sequence labeling is performed. The parameter of the CRF layer is a matrix A, A of (k +2) × (k +2)ijThe transition score from the ith tag to the jth tag is shown, and the tags marked before can be used when marking a position, so 2 is added to add a starting state to the head of the sentence and an ending state to the tail of the sentence. If a tag sequence y with a length equal to the sentence length is recorded (y)1,y2,...,yn) Then the model scores as follows for sentence x with a label equal to y:
Figure BDA0002611652910000071
wherein P isi,yiScore value, A, for sorting the ith word to the yi tagyi-1,yiRepresenting the transition score from the yi-1 st tag to the yi-th tag.
It can be seen that the score for the entire sequence is equal to the sum of the scores for the positions, and that the score for each position is derived from two parts, one part being the p output by the LSTMiThe other part is determined by the transfer matrix A of the CRF. Further, the normalized probability can be obtained by using Softmax:
Figure BDA0002611652910000072
wherein, ynIs a subsequence of tag sequence y, i.e., a tag that may be present. score (x, y) is a scoring that the label of sentence x equals y, score (x, y)n) The label for sentence x equals ynScoring of (4).
7) The log-likelihood estimate is maximized. The log-likelihood for one training sample (x, y) is given by:
Figure BDA0002611652910000073
8) a prediction tag for each word is obtained. The optimal path is solved using the dynamically planned Viterbi algorithm:
Figure BDA0002611652910000074
the Viterbi algorithm is a classical algorithm for solving the optimal path by dynamic programming, and the details of the invention are not repeated.
9) The CRF layer rules constraints. The tags for each word in the sentence are available through B-LSTM, but there is no guarantee that the tags are predicted correctly each time. The CRF layer may add constraints to the last predicted label to ensure that the predicted label is consistent with the rules, and the constraints may be automatically learned through the CRF layer during training of the training data. And accessing a CRF layer to predict sentence-level labels, so that the labeling process does not independently classify each word any more, the transition probability of the sequence is introduced, and finally the function loss is calculated and fed back to the network. Under the action of CRF, the sequence can be regulated according to transition probability.
In the embodiment, after the training and learning of the model are finished, the crawled 'upstream', 'midstream' and 'downstream' channel announcement information is used as a test data set to verify and evaluate the model processing result.
In specific implementation, the method provided by the technical scheme of the invention can be implemented by a person skilled in the art by adopting a computer software technology to realize an automatic operation process, and a system device for operating the method also needs to be in the protection scope of the invention. Referring to fig. 1, the embodiment further provides a channel announcement information extraction system based on the BiLSTM-CRF model, which includes a chinese word segmentation module (10) and a named entity identification module (20).
The Chinese word segmentation module (10) is used for performing Chinese word segmentation according to the relevant information of the navigation channel, and when the Chinese word segmentation is performed, an electronic navigation channel map object name word segmentation dictionary is constructed according to the navigation channel element map layer and is used as a login dictionary;
the named entity recognition module (20) is used for realizing key information extraction through geographic entity recognition, and comprises the steps of dividing elements which have practical significance to users in the channel notice information according to mechanisms O, places L, subjects S, events E and time T, constructing a text semantic extraction model of the channel notice, training by adopting a BilSTM-CRF model under the constraint of the text semantic extraction model, and extracting key information.
The implementation of each module can refer to the implementation description of the corresponding method step, and the invention is not repeated.
The technical scheme of the invention can be applied to various subsequent applications, such as:
(1) and the text information such as channel announcement, scale and the like is visualized. The Yangtze river channel map APP obtains channel announcement character information published by the Yangtze river channel bureau, information such as geographic positions, ranges, key contents and start-stop time in the character information is extracted through natural language processing technologies such as template matching word segmentation and named entity recognition, the information is subjected to structuring processing, spatial information such as the geographic positions and the ranges in the character information is matched with elements such as a water channel and mileage on an electronic channel map, a geographic fence is constructed through a coordinate point and a buffer area or a polygon, the channel announcement and the corresponding key contents in the scale are displayed on the geographic fence, and the start-stop time is used for controlling display and cancellation of the information.
(2) Visualization of important areas of the channel and navigation auxiliary reminding. Constructing a geo-fence by using regional areas such as an anchor area, a warning area, a navigation limiting area and the like, and displaying the geo-fence on an electronic channel map in an overlapping manner, wherein when a ship positioning signal is close to or in the range of the geo-fence, related early warning and warning information can be pushed to the ship positioning signal; the method comprises the steps that a geo-fence is constructed in the water channel range of the Yangtze river electronic channel map, and when a ship positioning signal is close to or located in the geo-fence range of the water channel, the channel maintenance scale, the navigation mark water level information, the channel announcement, the surrounding meteorological information and the related geographical information interest Points (POI) related to the water channel can be pushed to the geo-fence.
(3) Channel information interaction and pushing based on the mobile terminal. The Changjiang river channel chart APP provides a user reporting function, clicks on the electronic channel chart to obtain geographic coordinates of the position, creates a geographic fence according to the geographic coordinates, and generates reporting information. The user can attach text description or field photo to the reported information, and report the navigation field condition information to the relevant management department, which is beneficial to the first time confirmation of the field and the rapid update and release of the information.
For the sake of reference, the embodiment of the present invention provides a detailed description of the channel information visualization as follows:
the identified geographic entities, namely, the entities marked (labeled) as "locations" (locations) in the named entity identification step, are spatially matched with the electronic channel map, a geographic fence is generated by taking a spatial position as a center, real-time channel notification information is marked, the visualization process is as shown in fig. 7, and the detailed implementation steps are described as follows:
step 1, analyzing and acquiring longitude and latitude of the current position based on AIS data or mobile terminal GPS data, judging whether the current position is located in a relevant APP map range, and if not, roaming to the map where the current position is located.
Step 2, extracting the center of the navigation channel element ground object so as to draw the notification information at the center position of the ground object: and performing superposition analysis in the current map range to obtain typical channel ground objects with definite spatial position characteristics, such as a channel, a navigation mark, a bridge, a wharf and the like, and sequentially calculating the central position of the typical channel ground objects, so that channel notification information can be drawn in the middle. For point-like ground objects such as navigation marks, obstructive objects and the like, the central position of the point-like ground objects is represented by an actual position; for a linear or planar ground object such as a bridge, a wharf, a water channel, etc., the center position thereof can be expressed as:
Figure BDA0002611652910000091
wherein xiAnd yiIs the coordinate of the point element i constituting the line and plane elements, and n is equal to the total number of the point elements constituting the line and plane elements.
And 3, calculating a proper (such as one third of the screen width) buffer area radius or a polygonal range according to the current mobile equipment resolution and the center position obtained in the step 2, and sequentially constructing the geo-fences.
And 4, calculating whether the geofences constructed in the step 3 are covered or not, wherein for the simple polygonal geofence, a ray method has high query efficiency, starting from each point of the geofence A, drawing a ray along an X axis, judging the intersection point of the ray and each edge of the geofence B, counting the number of the intersection points, if the number of the intersection points is even, the geofences A and B are not covered, otherwise, the geofences A and B are covered, and at the moment, the geofence range needs to be adjusted or the geofence range needs to be subjected to offset processing.
And 5, sequentially acquiring corresponding key information through the WebService service request based on the ground object name acquired in the step 2.
And 6, organizing according to a preset specific format (such as a ground object name + an event + time) to simplify the channel notification information based on the ground object center position obtained in the step 2 and the key information obtained in the step 5, and drawing and labeling in the geo-fence range determined in the step 4.
In specific implementation, the above applications can also be automatically run in a software manner.
The specific embodiments described herein are merely illustrative of the spirit of the invention. Various modifications or additions may be made to the described embodiments or alternatives may be employed by those skilled in the art without departing from the spirit or ambit of the invention as defined in the appended claims.

Claims (10)

1. A channel announcement information extraction method based on a BilSTM-CRF model is characterized by comprising the following steps:
step 1, Chinese word segmentation is carried out according to relevant information of a navigation channel, and when Chinese word segmentation is carried out, an electronic navigation channel graph target name word segmentation dictionary is constructed according to a navigation channel element graph layer and is used as a login dictionary;
and 2, extracting key information through geographic entity recognition, wherein the key information extraction is realized by dividing elements which have practical significance to users in the channel notice information according to a mechanism O, a place L, a subject S, an event E and time T, constructing a text semantic extraction model of the channel notice, training by adopting a BilSTM-CRF model under the constraint of the text semantic extraction model, and extracting the key information.
2. The method for extracting information of a channel notice based on a BilSTM-CRF model as claimed in claim 1, wherein: channel information acquisition is carried out in advance, and comprises the steps of acquiring and storing channel related information, wherein the channel related information comprises channel announcements, planned water depth and maintenance scale; and acquiring relevant information of the navigation channel by adopting a focused web crawler mode.
3. The method for extracting information of a channel notice based on a BilSTM-CRF model as claimed in claim 2, wherein: and when crawling the page, putting the filtered links into a URL queue in sequence according to the priorities of 'important', 'upstream', 'midstream' and 'downstream'.
4. The method for extracting BiLSTM-CRF model-based channel announcement information as claimed in claim 1, 2 or 3, wherein: the implementation mode of constructing the electronic channel map object name word segmentation dictionary according to the channel element map layer is as follows,
step 1.1, loading channel element layers in batches;
step 1.2, reading the element, extracting the element name according to the attribute field, and storing the result to a read attribute name list;
step 1.3, judging whether unread elements exist at present, if so, continuing to read the elements, returning to the step 1.2, and if not, ending the reading process and entering the step 1.4;
and step 1.4, according to the final name list obtained in the step 1.2, writing the final name list into the text file in sequence according to the format of 'name + line feed' of the Chinese word segmentation dictionary, and outputting the final file as the word segmentation dictionary.
5. The method for extracting BiLSTM-CRF model-based channel announcement information as claimed in claim 1, 2 or 3, wherein: in the text semantic extraction model of the channel announcement,
a mechanism O for identifying a channel announcement issuing mechanism;
a location L for identifying position-related information contained in the channel announcements, including typical channel features with unambiguous spatial location characteristics;
the theme S is used for identifying the main content contained in the channel announcement, wherein the main content comprises channel special element objects and the running state of a channel;
event E, used for identifying the procedural content in the channel announcement, including natural events and artificial events;
and the time T is used for identifying the release time of the channel announcement.
6. The method for extracting BiLSTM-CRF model-based channel announcement information as claimed in claim 1, 2 or 3, wherein: and training by adopting a BilSTM-CRF model under the constraint of the text semantic extraction model, wherein the training comprises the steps of marking the text semantic extraction model by using a BIO marking set adopted in Bakeoff-3 evaluation, and adding constraint for a finally predicted label on a CRF layer of the BilSTM-CRF model.
7. The method for extracting BiLSTM-CRF model-based channel announcement information as claimed in claim 1, 2 or 3, wherein: the method is used for spatial information visualization, and comprises the steps of carrying out spatial matching on a geographic entity with a place as a label identified by a BilSTM-CRF model and an electronic channel map, generating a geographic fence by taking a spatial position as a center, and marking and displaying real-time channel notification information.
8. The method for extracting BiLSTM-CRF model-based channel announcement information as claimed in claim 1, 2 or 3, wherein: the method is used for channel key area visualization and navigation auxiliary reminding.
9. The method for extracting BiLSTM-CRF model-based channel announcement information as claimed in claim 1, 2 or 3, wherein: the method is used for channel information interaction and pushing based on the mobile terminal.
10. A channel announcement information extraction system based on a BilSTM-CRF model is characterized in that: for carrying out the method of extracting information of a route announcement based on a BiLSTM-CRF model as claimed in claims 1 to 9.
CN202010756216.6A 2020-07-31 2020-07-31 Channel notification information extraction method and system based on BiLSTM-CRF model Active CN111914539B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010756216.6A CN111914539B (en) 2020-07-31 2020-07-31 Channel notification information extraction method and system based on BiLSTM-CRF model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010756216.6A CN111914539B (en) 2020-07-31 2020-07-31 Channel notification information extraction method and system based on BiLSTM-CRF model

Publications (2)

Publication Number Publication Date
CN111914539A true CN111914539A (en) 2020-11-10
CN111914539B CN111914539B (en) 2024-09-10

Family

ID=73288173

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010756216.6A Active CN111914539B (en) 2020-07-31 2020-07-31 Channel notification information extraction method and system based on BiLSTM-CRF model

Country Status (1)

Country Link
CN (1) CN111914539B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112861540A (en) * 2021-04-25 2021-05-28 成都索贝视频云计算有限公司 Broadcast television news keyword automatic extraction method based on deep learning
CN113011183A (en) * 2021-03-23 2021-06-22 北京科东电力控制系统有限责任公司 Unstructured text data processing method and system in electric power regulation and control field
CN113127503A (en) * 2021-03-18 2021-07-16 中国科学院国家空间科学中心 Automatic information extraction method and system for aerospace information
CN113282767A (en) * 2021-04-30 2021-08-20 武汉大学 Text-oriented relative position information extraction method
CN114819771A (en) * 2022-06-28 2022-07-29 北京中海住梦科技有限公司 Task allocation method and device, storage medium and electronic equipment
CN118055175A (en) * 2024-04-16 2024-05-17 南京莱斯信息技术股份有限公司 Message analysis processing method combining rule engine and deep learning

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018028077A1 (en) * 2016-08-11 2018-02-15 中兴通讯股份有限公司 Deep learning based method and device for chinese semantics analysis
CN108595430A (en) * 2018-04-26 2018-09-28 携程旅游网络技术(上海)有限公司 Boat becomes information extracting method and system
CN109697285A (en) * 2018-12-13 2019-04-30 中南大学 Enhance the hierarchical B iLSTM Chinese electronic health record disease code mask method of semantic expressiveness
CN110990565A (en) * 2019-11-20 2020-04-10 广州商品清算中心股份有限公司 Extensible text analysis system and method for public sentiment analysis
CN111274804A (en) * 2020-01-17 2020-06-12 珠海市新德汇信息技术有限公司 Case information extraction method based on named entity recognition

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018028077A1 (en) * 2016-08-11 2018-02-15 中兴通讯股份有限公司 Deep learning based method and device for chinese semantics analysis
CN108595430A (en) * 2018-04-26 2018-09-28 携程旅游网络技术(上海)有限公司 Boat becomes information extracting method and system
CN109697285A (en) * 2018-12-13 2019-04-30 中南大学 Enhance the hierarchical B iLSTM Chinese electronic health record disease code mask method of semantic expressiveness
CN110990565A (en) * 2019-11-20 2020-04-10 广州商品清算中心股份有限公司 Extensible text analysis system and method for public sentiment analysis
CN111274804A (en) * 2020-01-17 2020-06-12 珠海市新德汇信息技术有限公司 Case information extraction method based on named entity recognition

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王红;李浩飞;邸帅;: "民航突发事件实体识别方法研究", 计算机应用与软件, no. 03, 12 March 2020 (2020-03-12) *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113127503A (en) * 2021-03-18 2021-07-16 中国科学院国家空间科学中心 Automatic information extraction method and system for aerospace information
CN113011183A (en) * 2021-03-23 2021-06-22 北京科东电力控制系统有限责任公司 Unstructured text data processing method and system in electric power regulation and control field
CN113011183B (en) * 2021-03-23 2023-09-05 北京科东电力控制系统有限责任公司 Unstructured text data processing method and system in electric power regulation and control field
CN112861540A (en) * 2021-04-25 2021-05-28 成都索贝视频云计算有限公司 Broadcast television news keyword automatic extraction method based on deep learning
CN113282767A (en) * 2021-04-30 2021-08-20 武汉大学 Text-oriented relative position information extraction method
CN113282767B (en) * 2021-04-30 2022-08-30 武汉大学 Text-oriented relative position information extraction method
CN114819771A (en) * 2022-06-28 2022-07-29 北京中海住梦科技有限公司 Task allocation method and device, storage medium and electronic equipment
CN118055175A (en) * 2024-04-16 2024-05-17 南京莱斯信息技术股份有限公司 Message analysis processing method combining rule engine and deep learning

Also Published As

Publication number Publication date
CN111914539B (en) 2024-09-10

Similar Documents

Publication Publication Date Title
CN111914539B (en) Channel notification information extraction method and system based on BiLSTM-CRF model
Singh et al. Event classification and location prediction from tweets during disasters
McDonough et al. Named entity recognition goes to old regime France: geographic text analysis for early modern French corpora
JP5390840B2 (en) Information analyzer
Li et al. Mining trajectory data and geotagged data in social media for road map inference
CN103886020B (en) A kind of real estate information method for fast searching
Prasad et al. Identification and classification of transportation disaster tweets using improved bidirectional encoder representations from transformers
WO2019227581A1 (en) Interest point recognition method, apparatus, terminal device, and storage medium
JP2022532451A (en) How to disambiguate Chinese place name meanings based on encyclopedia knowledge base and word embedding
Richter et al. Zooming in–zooming out hierarchies in place descriptions
Vaccari et al. A holistic framework for the study of urban traces and the profiling of urban processes and dynamics
Stock et al. Detecting geospatial location descriptions in natural language text
CN114298035A (en) Text recognition desensitization method and system thereof
Wang et al. Mapping the landscape and roadmap of geospatial artificial intelligence (GeoAI) in quantitative human geography: An extensive systematic review
US20230316098A1 (en) Machine learning techniques for extracting interpretability data and entity-value pairs
WO2020170020A1 (en) Feedback mining with domain-specific modeling
CN111914538B (en) Channel notification information intelligent space matching method and system
Fernández-Martínez et al. nLORE: A linguistically rich deep-learning system for locative-reference extraction in tweets
Jaiswal et al. GeoCAM: A geovisual analytics workspace to contextualize and interpret statements about movement
US20230259809A1 (en) Machine learning techniques for context-based document classification
Terblanche et al. Ontology‐based employer demand management
CN113626536B (en) News geocoding method based on deep learning
CN110969836A (en) Road condition real-time analysis system based on network big data
CN114780744A (en) Figure resume analysis method for knowledge graph construction
Soni Integration of traffic data from social media and physical sensors for near real time road traffic analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant